Using n8n To Orchestrate Multiple Agents

I’ve been heads-down building a series of AI step-by-step labs, and this one might be my favorite so far: a practical, cost-savvy “mixture of experts” architectural pattern you can run with n8n and self-hosted models on Ollama.

The idea is simple. Not every prompt needs a heavyweight reasoning model. In fact, most don’t. So we put a small, fast model in front to classify the user’s request—coding, reasoning, or something else—and then hand that prompt to the right expert. That way, you keep your spend and latency down, and only bring out the big guns when you really need them.

Architecture at a glance:

Two hosts: one for your models (Ollama) and one for your n8n app. Keeping these separate helps n8n stay snappy while the model server does the heavy lifting.
Docker everywhere, with persistent volumes for both Ollama and n8n so nothing gets lost across restarts.
Optional but recommended: NVIDIA GPU on the model host, configured with the NVIDIA Container Toolkit to get the most out of inference.

On the model server, we spin up Ollama and pull a small set of targeted models:

deepseek-r1:1.5b for the classifier and general chit-chat
deepseek-r1:7b for the reasoning agent (this is your “brains-on” model)
codellama:latest for coding tasks (Python, JSON, Node.js, iRules, etc.)
llama3.2:3b as an alternative generalist

On the app server, we run n8n. Inside n8n, the flow starts with the “On Chat Message” trigger. I like to immediately send a test prompt so there’s data available in the node inspector as I build. It makes mapping inputs easier and speeds up debugging.

Next up is the Text Classifier node. The trick here is a tight system, prompt and clear categories:

Categories: Reasoning and Coding
Options: When no clear match → Send to an “Other” branch
Optional: You can allow multiple matches if you want the same prompt to hit more than one expert. I’ve tried both approaches. For certain, ambiguous asks, allowing multiple can yield surprisingly strong results.

I attach deepseek-r1:1.5b to the classifier. It’s inexpensive and fast, which is exactly what you want for routing. In the System Prompt Template, I tell it:

If a prompt explicitly asks for coding help, classify it as Coding
If it explicitly asks for reasoning help, classify it as Reasoning
Otherwise, pass the original chat input to a Generalist

From there, each classifier output connects to its own AI Agent node:

Reasoning Agent → deepseek-r1:7b
Coding Agent → codellama:latest
Generalist Agent (the “Other” branch) → deepseek-r1:1.5b or llama3.2:3b

I enable “Retry on Fail” on the classifier and each agent. In my environment (cloud and long-lived connections), a few retries smooth out transient hiccups. It’s not a silver bullet, but it prevents a lot of unnecessary red Xs while you’re iterating.

Does this actually save money? If you’re paying per token on hosted models, absolutely. You’re deferring the expensive reasoning calls until a small model decides they’re justified. Even self-hosted, you’ll feel the difference in throughput and latency. CodeLlama crushes most code-related queries without dragging a reasoning model into it. And for general questions—“How do I make this sandwich?”—A small generalist is plenty.

A few practical notes from the build:

Good inputs help. If you know you’re asking for code, say so. Your classifier and downstream agent will have an easier time.
Tuning beats guessing. Spend time on the classifier’s system prompt. Small changes go a long way.
Non-determinism is real. You’ll see variance run-to-run. Between retries, better prompts, and a firm “When no clear match” path, you can keep that variance sane.
Bigger models, better answers. If you have the budget or hardware, plugging in something like Claude, GPT, or a higher-parameter DeepSeek will lift quality. The routing pattern stays the same.

Where to take it next:

Wire this to Slack so an engineering channel can drop prompts and get routed answers in place.
Add more “experts” (e.g., a data-analysis agent or an internal knowledge agent) and expand your classifier categories.
Log token counts/latency per branch so you can actually measure savings and adjust thresholds/models over time.

This is a lab, not a production, but the pattern is production-worthy with the right guardrails. Start small, measure, tune, and only scale up the heavy models where you’re seeing real business value. Let me know what you build—especially if you try multi-class routing and send prompts to more than one expert. Some of the combined answers I’ve seen are pretty great.