What I Learned Building AI Agents
Practical lessons from building OpenClaw and Ralph Loop — where agentic systems break and how to make them reliable.
The demo version of an AI agent is easy. You give an LLM a tool, it calls the tool, something happens, and the model narrates the result. It looks like magic. The production version is different. The moment you run an agent against real work — not a curated benchmark — you start seeing the failure modes that the demo quietly papered over. The two projects I've been building, OpenClaw and Ralph Loop, are where those lessons stopped being theoretical.
The first lesson is that tools are the weakest link, not the model. Agents fail far more often because a shell command hung, a network call returned a partial response, or an API surfaced an error shape the prompt never anticipated, than because the model "didn't understand." Every tool an agent can call needs a deterministic contract: timeouts, explicit error envelopes, idempotency where possible, and output that's small enough to fit back into context without a second summarization step. I now treat tool design as the main engineering work — the prompt is downstream of that.
The second is that prompts are brittle in the ways you can't predict. A prompt that works flawlessly on ten examples will break on the eleventh because the user phrased something slightly differently, or a tool returned a field the model hadn't seen before. Defending against this isn't about writing longer prompts — it's about building fallback strategies into the agent loop itself: a second model that retries with different framing when the first stalls, deterministic checks between steps that catch malformed tool calls before they execute, and an explicit "give up cleanly" path so the agent fails loudly instead of hallucinating through a bad state. Reliability in agents comes from the scaffolding around the model, not from the model getting smarter.