Why not one general-purpose agent that does everything?

Because general-purpose means mediocre at everything. A project management agent needs structured thinking, timeline awareness, and task decomposition skills. An art generation agent needs visual vocabulary, style knowledge, and iterative refinement instincts. Cramming both into one system prompt produces an agent that's uncertain about its own role. Specialized agents have clear boundaries, consistent behavior, and measurable quality.

Can users create their own agents?

Yes, through the agent-builder agent. You describe what you want -- the agent's purpose, its personality, which tools it should have access to -- and agent-builder creates it via ManageAgentTool. The created agent gets its own identity, welcome message, and capability set. It's an AI that builds other AIs, which sounds recursive until you see it work.

Why did you switch from streaming to polling?

Streaming (Server-Sent Events) worked well for simple chat responses but created problems with tool execution. When an agent calls GoogleSearchWithScraperTool or CodeExecutionTool, the operation can take 10-30 seconds. During that time, the streaming connection is open but silent, which triggers timeouts in load balancers and proxies. Polling with non-streaming sync eliminated those timeout issues and simplified error handling. The user experience stayed responsive through optimistic UI updates and progress indicators.

Forty Agents and Counting

Compass was the first. A general-purpose assistant meant to orient new users – answer questions, explain features, suggest next steps. We shipped it in the first week of Dashboard v2 development, November 2025. It was adequate. It was also the last time we tried to build one agent that handled everything.

By February 2026, we had 40+ specialized agents. Each with its own personality, its own capabilities, and its own localized prompts in English and Russian. This is how that happened, and what we learned along the way.

The personality problem

The first thing you discover when building AI agents is that a system prompt isn’t a personality. It’s a suggestion.

Compass, our default agent, had a system prompt that said something like: “You are a helpful assistant. Be concise. Be accurate.” Standard fare. The problem was that Compass behaved differently depending on the model behind it. GPT-4o made Compass chatty and eager to please. Claude made Compass thoughtful but verbose. DeepSeek R1 made Compass analytical to the point of being obtuse.

An agent needs to behave consistently regardless of the underlying model. That means the system prompt has to be specific enough to override model tendencies. Not “be helpful” but “respond in three sentences or fewer unless the user explicitly asks for detail.” Not “be creative” but “propose exactly two alternatives, explain the trade-offs of each, and recommend one.”

This realization pushed us toward specialization. A general-purpose system prompt can’t be specific enough to produce consistent behavior. A specialized system prompt can. When pm (our project management agent) always thinks in tasks, timelines, and dependencies, the behavior is predictable across models.

The agent roster

40+ agents is a lot to maintain. Here’s the taxonomy we settled on:

Productivity agents handle structured work. pm decomposes projects into tasks with dependencies and timelines. academic-pavel handles research with citations and methodology awareness. advertiser-jamie writes copy with audience targeting and A/B testing instincts.

Creative agents produce visual and written content. salvador-dali generates images with style-aware prompting – it understands the difference between “impressionist” and “post-impressionist” and translates that into DALL-E and Replicate prompts. Other creative agents handle different domains with the same depth of specialized knowledge.

Technical agents work with code and data. ux-web-analyst-elite evaluates interfaces against usability heuristics and accessibility standards. Code-focused agents can execute JavaScript in sandboxed environments, review code structure, and suggest refactoring patterns.

Meta agents operate on the agent system itself. agent-builder is the most interesting – it creates new agents through ManageAgentTool, defining their personality, capabilities, and localized prompts. It’s an AI that designs other AIs.

Each agent has a welcome message that sets expectations. When you switch to pm, it doesn’t say “Hello, how can I help you?” It says something specific about project management, establishing its role before the first interaction.

The tool system

Agents without tools are just chatbots with personality. The tool system is what gives our agents real capabilities. Nine tool types, each solving a different problem:

GoogleSearchWithScraperTool combines web search with page scraping. An agent can search for information, then read the actual content of relevant pages. This isn’t just returning search snippets – it’s full page extraction with content cleaning.

PerplexitySearchTool provides AI-driven search with citations. When an agent needs to answer a factual question with sources, Perplexity delivers structured results that the agent can reference in its response.

QuestionnaireTool lets agents ask structured questions. Instead of a free-form “tell me more about your project,” pm can present a structured intake form: project name, deadline, team size, key constraints. The responses feed back into the agent’s context as structured data.

CodeExecutionTool runs JavaScript in an E2B sandbox. The code executes in isolation – it can’t access other users’ data, can’t make unauthorized network requests, can’t consume unbounded resources. The agent writes code, executes it, sees the output, and iterates.

GenerateImageTool connects to DALL-E for image generation. salvador-dali uses this extensively, but any agent with the right permissions can generate images as part of a response.

ScheduleTool creates and manages scheduled tasks. An agent can set up a cron job that runs a specific operation at a specific time – daily report generation, weekly data collection, periodic health checks.

BackgroundTaskTool delegates long-running operations to E2B sandboxes. When a task is too complex for a single chat response – analyzing a large codebase, generating a comprehensive report, running a multi-step data pipeline – the agent delegates it to a background task that runs asynchronously.

ManageAgentTool creates and modifies agents. Only available to agent-builder and admin-level agents. The created agents are first-class citizens – they show up in the agent picker, have their own welcome messages, and persist across sessions.

Replicate Image Generation provides an alternative image generation path through Replicate’s model library. Different models, different styles, different capabilities than DALL-E.

The tool system is composable. An agent can search the web, extract data from the results, execute code to process that data, generate a visualization, and schedule the whole pipeline to run weekly. The tools are building blocks. The agent decides the assembly.

The agent-builder paradox

agent-builder deserves its own section because it’s the most conceptually interesting piece of the system.

When a user asks agent-builder to create a new agent, it needs to make several decisions: What personality should the agent have? What tools should it have access to? What should its welcome message communicate? How should its system prompt be structured to produce consistent behavior?

agent-builder makes these decisions based on the user’s description, its understanding of the existing agent roster (to avoid duplication), and patterns it’s learned from the agents we built manually. It then calls ManageAgentTool with the full agent specification.

The paradox is obvious: can an AI reliably design other AIs? Our experience says yes, with constraints. agent-builder produces agents that are functional and well-structured. It doesn’t produce agents that are surprising. The manually crafted agents – the ones where a human spent hours tuning the system prompt, testing edge cases, adjusting tone – are noticeably better. agent-builder is a starting point, not a finished product.

We’ve seen users create agents, test them, then iterate on the system prompt manually. That hybrid workflow – AI-generated scaffold, human-refined personality – produces the best results.

Localization: two languages, twice the work

Every agent exists in English and Russian. That’s not just translation – it’s localization. A project management agent that works in Russian needs to understand Russian business terminology, Russian deadline conventions, and the cultural context of how Russian teams communicate about project status.

The system prompts are separately authored for each language, not machine-translated. The welcome messages are separately written. The tool descriptions that appear in the UI are separately maintained. This doubles the maintenance burden for every agent we add.

We initially tried using AI to translate system prompts from English to Russian. The results were technically correct and culturally off. Russian professional communication has different formality norms, different assumptions about hierarchy, and different idioms. An agent that’s naturally conversational in English comes across as inappropriately casual in Russian if you just translate the words.

From streaming to polling

This was a controversial internal decision. Dashboard v1 used Server-Sent Events for streaming chat responses. Users saw tokens appear one at a time. The experience felt responsive and modern.

Dashboard v2 started with the same approach. Then we added tool execution.

When an agent calls GoogleSearchWithScraperTool, the operation takes 5-15 seconds. CodeExecutionTool in an E2B sandbox can take 10-30 seconds. BackgroundTaskTool can take minutes. During these operations, the SSE connection is open but silent. Load balancers interpret silence as a dead connection. Proxies time out. The connection drops, and the client has to reconnect and figure out what it missed.

We switched to non-streaming synchronous responses with client-side polling. The agent processes the full response (including all tool calls), and the client polls for completion. The UI shows progress indicators during tool execution instead of a frozen stream.

The trade-off is real: users don’t see tokens appear incrementally anymore. But tool-heavy interactions – which are the majority of agent interactions – are dramatically more reliable. No dropped connections. No reconnection logic. No partial responses.

What forty agents taught us

The number 40 isn’t the point. The insight is that specialization beats generalization in AI agents the same way it does in human teams. You don’t hire one person to do project management, art direction, code review, and research. You hire specialists. AI agents work the same way.

The challenge ahead is maintaining quality at scale. Every new agent is a system prompt to maintain, a capability set to test, a localization effort to complete. The evaluation framework we built (covered in another post) helps catch regressions. But the fundamental tension between agent count and agent quality doesn’t go away.

We’re at 40 and counting. The counting part is the hard part.

Forty Agents and Counting

The personality problem

The agent roster

The tool system

The agent-builder paradox

Localization: two languages, twice the work

From streaming to polling

What forty agents taught us

Related Posts

5,876 Commits Across Three AI Products

Building Custom GPTs Before OpenAI Did

Research Agents Before Agents Were Cool

See what AIWAYZ can do for your team

Products

Solutions

Company

Legal