Over the past few months, I’ve been learning and building AI Agents / LLM-driven automation workflows.
I learned a lot - some exciting, some challenging and I decided to document my learnings in this blog.
LLM Workflows and Agentic Automation is Backend Engineering with Extra Steps 🔗
Yep! Maybe I’ll ruffle some feathers, but building AI agents is very much a backend / automation pipeline-driven process. The same basic principles apply — and that’s actually a good thing.
This means we already understand many of the pitfalls and curveballs that come with production-grade automation. The difference is: in the case of agents, a mistake can break your bank.
Apply all the validation you can. Which brings me to the next point…
The Hard Part Is Deciding What Not to Pass 🔗
When designing prompts and workflows, the instinct is to keep feeding more context into the model — more examples, more documents, more background. But in practice, irrelevant or noisy information often hurts more than it helps.
- Passing too much unfiltered context can cause models to drift, hallucinate, or simply waste tokens.
- Filtering, chunking, and ranking inputs before they reach the LLM is often the difference between a brittle agent and a reliable one.
- Think of it as building an information diet: curating what the model sees is just as important as how you frame the task.
Don’t just observe your flow. Be in the loop.
Human-in-the-Loop (HITL) Is Not Optional 🔗
When prototyping agents, it’s tempting to automate everything end-to-end. But in practice, the Human-in-the-Loop (HITL) pattern often determines whether the system produces intelligent behavior or unpredictable nonsense.
- During prototyping, humans can validate intermediate outputs, approve critical actions, or correct model drift.
- In production, HITL is still relevant if accountability, compliance, and security of your users’ input is your responsibility.
- HITL allows the system to learn faster and fail safer.
No smooth segue here — let’s get to the next point
Orchestration > Model 🔗
It’s easy to think the latest state-of-the-art model will solve everything. But I’ve seen firsthand that poor orchestration can cripple even the best models.
- Without careful workflow design, an LLM may loop endlessly, call the wrong tools, or misinterpret context.
- Decisions about memory management, tool-calling, and step ordering often matter more than which model you choose.
- In fact, a well-orchestrated system with a mid-tier model can often outperform a poorly orchestrated one with a cutting-edge LLM.
Many of my automations failed — sometimes due to shaky orchestration, and other times because of my next point (and main frustration).
Observability Is Not There Yet 🔗
One of the biggest ongoing challenges is observability. Unlike traditional software systems, where logs and metrics are standardized, LLM-driven agents operate like a black box.
- How do you debug when an agent makes a bad decision?
- How do you trace reasoning across multiple steps, tools, and context windows?
- How do you measure accuracy, latency, or “hallucination rate” in production?
Right now, I’m experimenting with tools like LangGraph and Flowise, but I still don’t have a robust diagnostic setup for my room full of agents.
TL;DR 🔗
- I’m new and still learning, but backend engineering experience helps a lot.
- Guardrails are important.
- Focus on orchestration more than the SOTA model of the day.
- Figure out observability early.
Happy Building!