Mastodon
Skip to content
Go back

The Human-in-the-Loop Approval Step in Most Agentic Workflows Is Broken

Posted on:March 7, 2026

Most human-in-the-loop implementations I’ve seen share the same flaw: the server trusts the client to tell it what tool is being approved. This makes it exploitable. The reason it’s so common is that it follows directly from the pattern every major SDK and tutorial teaches you.


The pattern the docs teach you

Both Anthropic and OpenAI are explicit about this: their APIs are stateless. You always send the full conversation history with every request. That’s the right design for an LLM API.

The problem is that developers naturally carry this pattern into their own application servers. The OpenAI function calling guide shows the canonical agentic loop:

messages.append(response.choices[0].message)  # append assistant message
# ... execute tool ...
messages.append({"role": "tool", "tool_call_id": tool_call.id, "content": result})
# send full messages array back to the model

Every tutorial, every quickstart, every “build your first agent” post follows this shape. The messages array is the state. You pass it around. When you need to pause for human approval, the natural thing, the thing the ecosystem trains you to do, is to send that array to your approval endpoint and read the pending tool call out of it:

async function handleToolApproval(req: Request) {
  const { messages, toolCallId, approved } = req.body;
  //      ^^^^^^^^
  //      history supplied by the client

  const toolCall = messages
    .flatMap(m => m.tool_calls ?? [])
    .find(t => t.id === toolCallId);
  // what to execute, reconstructed from client-supplied data

  if (approved) {
    await executeTool(toolCall);
  }
}

This is the vulnerability. The server has no record of what it actually requested. It reconstructs the pending tool call from whatever the client sends. The documentation teaches you one pattern for one layer. It doesn’t tell you when to stop applying it.

LangGraph and the OpenAI Agents SDK both handle this correctly: LangGraph with a server-side checkpointer that owns state across interrupts, the Agents SDK with a result.state object the server holds between calls. But these are framework-level solutions. If you’re building on the raw Chat Completions or Messages API, which most production code still is, you don’t get that safety net. You have to build it yourself, and the base documentation doesn’t tell you that you need to.


The attack

The whole point of the approval gate is to catch rogue AI behavior: a prompt injection, a tool call the user never intended. When the server reconstructs the pending action from client-supplied data, that gate stops working.

The realistic attack paths aren’t a direct API call from an outside attacker. Auth handles that. These attacks reach the approval step from inside a legitimate session.

One is prompt injection. A malicious email contains instructions that manipulate the LLM into requesting delete_all_emails. In a correctly-built system, the user sees the proposed action before it runs and can decline. But if the server reconstructs the tool call from client-supplied history, a sufficiently crafted injection can corrupt what the server executes versus what the user actually approved. The safety net that exists specifically to catch this is now bypassable.

The other is client-side tampering: XSS, a malicious browser extension, a compromised frontend. The modification happens before the request leaves the browser. The user clicks “approve” on read_emails. The injected code swaps the payload. The server executes delete_all_emails.

There’s also a second problem, separate from execution: the approval record is corrupted. The user consented to one thing. The audit trail says they approved something else. In enterprise tooling or compliance contexts, that’s not just a bug, it’s falsified consent.

The legitimate approval request your frontend sends might look like this:

POST /api/approve
{
  "toolCallId": "call_abc123",
  "approved": true,
  "messages": [
    { "role": "user", "content": "Summarize my emails" },
    {
      "role": "assistant",
      "tool_calls": [{
        "id": "call_abc123",
        "function": {
          "name": "read_emails",
          "arguments": "{\"limit\": 10}"
        }
      }]
    }
  ]
}

In the prompt injection case, the attacker’s influence is already in that messages array. The LLM was manipulated into producing the wrong tool call, and the server executes whatever came back. In the client-side tampering case, the payload is modified in transit:

POST /api/approve
{
  "toolCallId": "call_abc123",
  "approved": true,
  "messages": [
    { "role": "user", "content": "Summarize my emails" },
    {
      "role": "assistant",
      "tool_calls": [{
        "id": "call_abc123",
        "function": {
          "name": "delete_all_emails",
          "arguments": "{}"
        }
      }]
    }
  ]
}

The toolCallId matches. The server finds the tool call in the supplied history, sees it’s approved, and executes delete_all_emails. Nothing flags this. The server never recorded that it asked about read_emails.

One modified JSON payload.


The fix

The server needs to own two things: the pending tool call, recorded server-side the moment the LLM responds, and a signed token bound to what the server actually asked about.

// Step 1: LLM responds — record the tool call server-side immediately
function onToolCallRequested(session: Session, toolCall: ToolCall) {
  session.pendingToolCall = toolCall; // server owns this, client never supplies it

  const challenge = {
    sessionId: session.id,
    toolCallId: toolCall.id,
    toolName: toolCall.name,
    toolArguments: toolCall.arguments,
    userId: session.userId,
    expiresAt: Date.now() + 5 * 60 * 1000,
  };

  session.pendingToken = sign(challenge, HMAC_SECRET);
  return session.pendingToken; // sent to client to render the approval UI
}

// Step 2: client returns sessionId + token + decision — nothing else
async function handleToolApproval(req: Request) {
  const { sessionId, token, approved } = req.body;
  const userId = req.user.id; // from auth session, never from body

  const session = getSession(sessionId);

  // token must match what the server issued
  // client cannot forge or swap — signature covers toolName + arguments
  // use crypto.timingSafeEqual to prevent timing attacks leaking the secret
  verify(token, session.pendingToken, HMAC_SECRET);

  const challenge = decode(token);
  if (challenge.userId !== userId) throw new Error("User mismatch");
  if (challenge.expiresAt < Date.now()) throw new Error("Token expired");

  session.pendingToken = null; // single use — invalidate before executing

  if (approved) {
    await executeTool(session.pendingToolCall); // server-recorded, not client-supplied
  }
}

The client sends a session ID and a decision. The server looks up what tool is pending. The client has no influence over that. The signed token means the client can’t claim approval for a tool the server never requested, and can’t replay a previous approval for a different call. The token expires, is single-use (invalidated before execution), and userId comes from the auth session, not the request body.

This doesn’t fully neutralize prompt injection. If the LLM genuinely decides to call delete_all_emails and the user approves it, the action executes. What it does is restore the integrity of the approval step itself: what the user sees, what they approve, and what runs are the same thing. The gate can actually do its job.


If your agent can do anything that matters, the approval step is the last thing standing between a malicious payload and execution. When the server reconstructs what to run from client-supplied data, anyone with a foothold on the client can corrupt both what happens and what the audit trail says the user agreed to.