Building an AI Browser That Can See, Understand, and Act on Web Pages
We built a Chromium-based browser with a built-in AI assistant that can read page content, automate user actions via natural language, record workflows, and run conditional triggers — across Windows, macOS, and Linux.
3
Platforms (Win, Mac, Linux)
3
AI Modes (Local, Cloud, Hybrid)
5
QA Input Methods
0
User Data Collected
The Challenge
A Browser Where AI Isn't a Sidebar — It's the Core
The client's vision was ambitious: a browser where AI doesn't just answer questions, it can actually see what's on the page, understand the DOM, and take action. Type "open Gmail, log in with my credentials, and read new emails" — and the browser does it.
On top of that, they wanted the AI to be flexible: users could choose between a local model running in Docker, OpenAI's API, or a hybrid of both. And everything had to be privacy-first — no user data leaves the machine unless the user explicitly opts for cloud AI.
AI Needs Page Awareness
The AI couldn't just be a chatbot. It needed access to the rendered page — visible UI, DOM structure, element IDs — to understand what the user is looking at and act on it.
Natural Language Automation
Users needed to give instructions in plain English — "click submit," "fill in this form," "search for this" — and the browser had to execute those actions reliably via Selenium WebDriver integration.
Record, Replay, and Condition
Users needed to record their actions as replayable workflows stored in natural language, with the ability to set conditional triggers like "if stock price hits X, click Y."
Privacy with Flexibility
Local AI via Docker for full privacy, cloud AI for power users, hybrid for the best of both. Plus an analytics layer for the parent company that tracks installs and updates — but never touches user data.
Technical Architecture
How It All Connects
The browser sits on a Chromium base with a custom layer that bridges the rendering engine to the AI backend. The AI has read access to the page DOM and can dispatch actions through the automation engine.
System Architecture
Chromium Shell
Rendering engine, tab management, page context
AI Engine
Chat interface, DOM awareness, NLP processing
Automation Layer
Selenium WebDriver, action recorder, conditional triggers
— AI Backend Options —
Local AI
Docker container running ML model on user machine
Cloud AI
OpenAI API for powerful inference
Hybrid
Local for simple tasks, cloud for complex ones
Key Features
What Makes This Browser Different
Floating AI Chat with Page Awareness
A floating chat box sits at the bottom of the browser. Users type natural language commands and the AI executes them. The AI has full access to the rendered page — it sees the UI components, reads the DOM, knows element IDs, and can interact with any element on the page. Users can toggle the chat on or off, and switch between three AI modes at any time.
Local AI
Docker-hosted model, fully offline
OpenAI
Cloud-powered, GPT-level inference
Hybrid
Local first, cloud for complex tasks
Action Recording & Conditional Playback
Users can hit record and the browser captures every action as natural language steps — "clicked the Submit button," "typed 'hello' into the search field." These recordings can be replayed later, edited, and extended with conditional logic. For example: "if the stock price on this page reaches $150, click the Buy button" or "if this element appears, show me a popup notification." The system stores workflows as human-readable scripts that can be modified and re-triggered on demand.
@QA — Interactive Page Analysis Panel
Typing @QA opens a dedicated panel that can pop out as a sidebar. It gives users multiple ways to interact with page content and feed it to the AI for analysis.
Text Select
Drag to select any text on the page, right-click to send it to AI for analysis, summarization, or Q&A.
Element Inspector
Hover over elements to highlight them (like DevTools inspector). Click to capture that element's content and send to AI.
Area Snapshot
Click and drag to select a rectangular area of the page. The selection is captured as an image and sent to AI for visual analysis.
Auto-Track Element
Select a web element to continuously send its text content to AI — useful for monitoring changing data like prices or scores.
File Upload
Upload documents, images, or any supported file directly into the QA panel for AI analysis — no need to leave the browser.
Smart Text Field Recognition
The browser detects every text input on any webpage and offers AI-powered actions right there — grammar check, translation, auto-complete, tone adjustment. It works across all sites without any extension or setup.
Scope of Work
What We Delivered
- Built a custom Chromium-based browser for Windows, macOS, and Linux with full standard browser functionality
- Designed a floating AI chat interface with page-aware context — AI reads rendered DOM, knows element IDs, and understands visible UI
- Integrated three AI backend modes: local model via Docker, OpenAI cloud API, and a hybrid option — user-switchable at any time
- Built natural language browser automation via Selenium WebDriver — users type commands like "log into Gmail and read new emails" and the browser executes
- Implemented action recording that stores steps in natural language, with playback, editing, and conditional trigger support
- Built the @QA panel with five input modes: text selection, element inspector, area snapshot, auto-tracking, and file upload — all piped to AI
- Added smart text field detection across all websites with inline AI actions — grammar, translation, and auto-suggestions
- Built the local AI Docker setup flow — browser detects if the local model is installed, guides the user through setup, and communicates via HTTP
- Implemented install tracking and OTA update delivery for the parent company — zero access to user browsing data or personal information
- Built two monetization modes: free tier with configurable ad placements (Google AdSense or self-hosted ad server) and premium ad-free tier
Tech Stack
Tools & Technologies Used
Results
The Impact
AI That Actually Does Things
This isn't a chatbot in a sidebar. The AI reads the page, understands the structure, and executes real actions — login, navigate, fill forms, click buttons — all from natural language.
Privacy on the User's Terms
Users who want full privacy run the local Docker model. Users who want power use OpenAI. Nobody's forced into a choice — and either way, no browsing data leaves the machine.
Workflows Anyone Can Build
The record-and-replay system with conditional triggers turns non-technical users into automation builders. "If this happens, do that" — in plain English, no code required.
Revenue Built In from Day One
Two-tier monetization — free with ads, premium without — gives the client a revenue model from launch, with full control over ad placement and timing.