Mastodon
Skip to content
Go back

Agentic Workflows Are Not Conversations

Posted on:March 8, 2026

Agentic Workflows Are Not Conversations

The chat interface is the right UX for users. It’s the wrong architecture for the system behind it. Most teams don’t separate the two, and the resulting problems are easy to misattribute to the model.

A developer sees the agent recommend a product the user removed from their cart five minutes ago. They file a bug against the LLM. The model hallucinated. Except it didn’t: the cart update never made it back into the message history. The agent was working from stale context. The LLM did exactly what it was told. The architecture failed to tell it the right thing.

That’s the pattern. The symptoms look like model error. The cause is structural.


App state and chat state are not the same thing

Your application has structured state: the current user, their selected items, the step they’re at in a flow, data loaded from your database. The LLM has context: a flat array of messages.

These two representations of state are not equivalent, and they diverge constantly.

When a user changes something in your UI mid-conversation, you have to get that change into the LLM’s context. The standard approach is a synthetic message ({"role": "user", "content": "The user has updated their cart"}) or you reconstruct the system prompt on every call to reflect current application state.

Neither is reliable. Synthetic messages break the conversation’s coherence. Reconstructed system prompts grow unbounded as state becomes more complex. More importantly, both approaches mean your application state and your agent state are maintained separately and reconciled manually. That reconciliation is invisible to your application logic. When they drift (and they will), debugging means reading a flat message array and inferring what the agent thought the application state was at each step.

The chat model has no concept of structured application state. It has messages. When your application has state that matters to the agent, you’re responsible for translating between two representations that were never designed to be equivalent.

When they drift and you can’t tell why, the symptom looks like model error. The send_email call went to the wrong recipient because the agent was working from a reconstructed system prompt that didn’t reflect the user’s last update. The LLM did exactly what it was told. The write call executed on bad input. The problem was upstream, invisible in the message array.


Tool calls are not messages

This is where the flat model causes the most damage, and it’s subtle because the damage is structural rather than immediately visible.

The OpenAI and Anthropic APIs represent tool calls as message parts: an assistant message with a tool_calls field, followed by tool role messages containing results. It’s a practical serialization format. As an architectural model it erases semantics that matter enormously for how you build and operate agentic systems.

There are three fundamentally different categories of tool call. The flat message model makes them all look identical.


Read calls: knowledge state is opaque

search_products, get_cart, fetch_user_profile: these observe the world but don’t change it. They seem harmless in the flat model. The problems are slower to surface.

A read call still produces state: the agent’s knowledge. Before search_products, the agent doesn’t know what matches the query. After it, it does. That transition matters, and the flat model makes it opaque.

Caching. If the user asks a follow-up that requires the same product search, there’s no clean way to detect that the result is already in history without parsing message content. The flat array gives you no typed handle on “this tool was called with these arguments and returned this result.” You re-fetch, or you write ad-hoc deduplication logic outside the agent loop.

Context window management. As history grows, tool results become expensive tokens you need to manage. You might want to drop old search results that are no longer relevant, or summarize a long product list. With a flat array, that means string manipulation on JSON-encoded message content. There’s no principled way to identify and prune stale tool results without parsing the entire history.

Dependency tracking. Downstream tool calls often depend on upstream results. Get cart, then check inventory for each item in the cart. In the flat model that dependency is implicit in message ordering. There’s no explicit record of which tool call produced which data, or what downstream calls consumed it. When something fails mid-chain, reconstructing what the agent knew at each step requires reading the full conversation.

None of these problems are catastrophic in isolation. Together, over a long-lived agentic workflow with many tool calls, they make the system increasingly difficult to reason about and maintain.


Write calls: the message array is not an audit log

send_email, place_order, delete_file: these change the world. The flat model’s problems here are more immediately serious.

When a write call executes, something happened. That event needs to be recorded independently of conversation history, not as a message in an array that the client can supply, modify, or reconstruct, but as an authoritative server-side fact.

The approval vulnerability is one symptom of not having this: when the server reconstructs the pending tool call from client-supplied message history, an attacker can swap the tool call before approval and the server has no ground truth to compare against. But the approval case is just the most exploitable instance of a broader problem.

Auditability. If a write call executes and something goes wrong, you need to know exactly what was executed, with what arguments, at what time, and in what context. That record should not live in a conversation array that can be mutated. It should be an append-only, server-owned event.

Recovery. If a write call fails partway through a multi-step workflow, you need to know which steps succeeded and which didn’t. The flat model has no vocabulary for this. You’re left inferring execution state from message content (which tool result messages are present, which are absent) rather than reading explicit success and failure events.

Replay. In a flat message array, replaying a failed workflow means re-running the entire conversation from scratch or manually constructing a modified history. With an explicit event log, replay means applying events to state in order, skipping or replacing the failed step.


Hybrid calls: side effects are invisible

reserve_inventory, create_draft, log_event: these change state as a side effect of reading, or are reversible writes. They’re the most underappreciated category, and the flat model gives you no way to identify them.

A reserve_inventory call looks like a read. You’re asking “is this item available?” But it also places a hold. If the workflow fails after the reservation but before the order, you have a dangling hold. In the flat model, nothing distinguishes this call from a pure read. The side effect is invisible to any logic outside the LLM loop.

The same applies to create_draft, which produces a retrievable artifact, or log_event, which writes to an audit trail. These calls have properties that matter for how you handle failures, retries, and rollbacks. The message array encodes none of them.


What the flat model erases

The uniform tool_calls message representation is convenient precisely because it’s uniform. You don’t have to classify your tools. You don’t have to decide upfront what’s a read, what’s a write, what’s a hybrid. Every tool call goes in as a message and comes back as a message.

That convenience is the problem. The semantics that matter for operating a real agentic system (whether to cache a result, whether to require approval, whether the operation is reversible, what the agent knew at each step, what can be retried safely) have to be re-derived elsewhere, in ad-hoc application logic that’s disconnected from the agent loop and has no access to the structured information it needs.

The taxonomy maps directly to a type system.


What typed tool calls look like

type ReadCall = {
  kind: "read";
  tool: string;
  args: Record<string, unknown>;
  result: unknown;
  cachedAt?: Date;
};

type WriteCall = {
  kind: "write";
  tool: string;
  args: Record<string, unknown>;
  executedAt: Date;
  approvedBy: string;
  result: unknown;
};

type HybridCall = {
  kind: "hybrid";
  tool: string;
  args: Record<string, unknown>;
  sideEffects: string[]; // e.g. ["inventory.reserve", "hold.created"]
  reversible: boolean;
  result: unknown;
};

type ToolCall = ReadCall | WriteCall | HybridCall;

With this, the properties that matter are no longer implicit in message ordering. Caching logic can branch on kind === 'read'. Audit logging can filter on kind === 'write'. Rollback logic has an explicit list of side effects to undo. None of this requires parsing JSON-encoded message content.


The mismatch

The chat metaphor works because it’s uniform. Conversations are linear. Messages accumulate. The model responds.

Agentic workflows break that the moment a tool call produces knowledge state, changes an external system, or creates a side effect that needs undoing. None of that fits a flat array of messages. The architecture needs to reflect what’s actually happening, not the conversational metaphor the interface presents to the user.

The chat interface is for users. The event log is for the system.