The LangGraph State Bug That Cost Us 3 Weeks

The agent worked in Jupyter. Every test passed. The LangGraph state graph ran three extraction nodes, merged results into state["extracted_data"], and returned clean structured output from GPT-4o. We shipped it.

Two days into production, the logistics client called. Half the freight invoices were returning partial extractions. Carrier name present, line items missing. Or line items present, total amount gone. No errors in the logs. No exceptions. The graph completed successfully every time.

I spent three weeks finding the root cause: two parallel nodes writing to state["extracted_data"] at the same time. One overwrote the other. LangGraph state management in production has failure modes that don't exist in notebooks, and this one was silent.

The setup that looked fine

The extraction graph had five nodes. Three ran in parallel for speed: one extracted header fields (carrier, date, invoice number), one extracted line items (description, quantity, rate), and one extracted totals (subtotal, tax, grand total). A fourth node merged the results. A fifth node validated the output against a Pydantic schema.

from langgraph.graph import StateGraph, END
from typing import TypedDict
 
class InvoiceState(TypedDict):
    raw_text: str
    extracted_data: dict
    validation_errors: list[str]
 
graph = StateGraph(InvoiceState)
graph.add_node("extract_header", extract_header)
graph.add_node("extract_lines", extract_lines)
graph.add_node("extract_totals", extract_totals)
graph.add_node("merge", merge_results)
graph.add_node("validate", validate_output)
 
# Parallel fan-out from START
graph.add_edge("__start__", "extract_header")
graph.add_edge("__start__", "extract_lines")
graph.add_edge("__start__", "extract_totals")
 
# Fan-in to merge
graph.add_edge("extract_header", "merge")
graph.add_edge("extract_lines", "merge")
graph.add_edge("extract_totals", "merge")
 
graph.add_edge("merge", "validate")
graph.add_edge("validate", END)

In Jupyter, this ran fine. Each node returned its piece, the merge node combined them, validation passed. The issue: each extraction node wrote directly to state["extracted_data"].

The state corruption nobody warned me about

LangGraph's state is a dictionary. When nodes run in parallel, each gets a copy of the current state, runs its function, and returns updates. LangGraph merges these updates back together.

Here's the problem. If two nodes write to the same top-level key, the last one to finish wins. There's no deep merge by default. Node A writes {"extracted_data": {"carrier": "FedEx", "date": "2026-03-01"}}. Node B writes {"extracted_data": {"line_items": [...]}}. Whichever finishes second replaces the entire extracted_data dict.

In Jupyter, the nodes ran fast enough on a single document that the timing was consistent. Node A always finished first, Node B second, Node C third. The merge node always saw all three results because by the time it ran, the last node's write included data from all previous runs (by coincidence of execution order).

In production, with 500 invoices hitting the graph concurrently and GPT-4o latency varying between 800ms and 4 seconds per call, the execution order was random. Sometimes the header node finished last and overwrote line items. Sometimes totals finished first and got erased by everything after it.

No errors. No exceptions. The graph completed "successfully" every time. The merge node just worked with whatever partial data was in state when it ran.

The fix: separate state keys

The solution was simple once I found it. Give each parallel node its own state key. Never let parallel nodes write to the same key.

from langgraph.graph import StateGraph, END
from typing import TypedDict
from operator import add
 
class InvoiceState(TypedDict):
    raw_text: str
    header_data: dict       # only extract_header writes here
    line_items: list[dict]  # only extract_lines writes here
    totals_data: dict       # only extract_totals writes here
    merged_result: dict     # only merge writes here
    validation_errors: list[str]
 
def extract_header(state: InvoiceState) -> dict:
    # Call GPT-4o for header extraction
    result = call_llm_header(state["raw_text"])
    return {"header_data": result}  # writes ONLY to header_data
 
def extract_lines(state: InvoiceState) -> dict:
    result = call_llm_lines(state["raw_text"])
    return {"line_items": result}   # writes ONLY to line_items
 
def extract_totals(state: InvoiceState) -> dict:
    result = call_llm_totals(state["raw_text"])
    return {"totals_data": result}  # writes ONLY to totals_data
 
def merge_results(state: InvoiceState) -> dict:
    merged = {
        **state.get("header_data", {}),
        "line_items": state.get("line_items", []),
        **state.get("totals_data", {}),
    }
    return {"merged_result": merged}

That's it. Each node writes to its own key. The merge node reads all three keys and combines them. No race condition, no overwrite, no timing dependency.

After the fix, extraction accuracy on the same 500-invoice batch went from 67% (with random partial extractions) to 96.2% (with only genuine LLM extraction errors remaining).

Retry logic: the second trap

With the state bug fixed, I hit the next problem. GPT-4o returns rate limit errors under load. I added a retry decorator to each extraction node.

from tenacity import retry, stop_after_attempt, wait_exponential
 
@retry(stop=stop_after_attempt(3), wait=wait_exponential(min=1, max=10))
def extract_header(state: InvoiceState) -> dict:
    result = call_llm_header(state["raw_text"])
    return {"header_data": result}

This worked until I also enabled LangGraph's built-in checkpointing with graph-level replay. When a node failed after 3 retries, LangGraph replayed the entire graph from the last checkpoint. Which re-ran the nodes that had already succeeded. Which burned more API tokens and hit rate limits again, triggering more retries, triggering more replays.

The API bill for one weekend was $340 before I noticed.

When to retry at the node, when to replay the graph

Node-level retry (tenacity, backoff): use for transient errors. Rate limits, timeouts, network blips. The node tries again with the same input. Quick recovery, no wasted work.

Graph-level replay (LangGraph checkpointing): use for state corruption or partial failures that affect downstream nodes. If the merge node gets bad data because an extraction node returned garbage (not an error, just wrong output), replay from the extraction step.

The rule I follow now: retry at the node for infrastructure failures (HTTP errors, timeouts). Replay the graph for logic failures (bad output that passed extraction but failed validation). Never both at the same time on the same failure path.

# Node-level: retry transient errors only
@retry(
    stop=stop_after_attempt(3),
    wait=wait_exponential(min=1, max=10),
    retry=retry_if_exception_type((RateLimitError, TimeoutError)),
)
def extract_header(state: InvoiceState) -> dict:
    result = call_llm_header(state["raw_text"])
    return {"header_data": result}
 
# Graph-level: replay on validation failure, not on node errors
def validate_output(state: InvoiceState) -> dict:
    errors = run_pydantic_validation(state["merged_result"])
    if errors and state.get("retry_count", 0) < 2:
        # Route back to extraction, not retry the same node
        return {"validation_errors": errors, "retry_count": state.get("retry_count", 0) + 1}
    return {"validation_errors": errors}

Testing agents that don't return the same thing twice

The hardest part of this project wasn't the state bug or the retry logic. It was testing.

GPT-4o doesn't return the same extraction for the same invoice twice. Field ordering changes. Whitespace differs. Occasionally the model extracts "FedEx Ground" instead of "FedEx" for the carrier name. Traditional assertEqual tests are useless.

I ended up building a three-layer test strategy:

Deterministic unit tests with mocked LLM responses. I captured 50 real GPT-4o responses during development and saved them as fixtures. Unit tests use these fixtures instead of calling the API. This tests graph wiring, state management, and merge logic without LLM variability.

Schema validation tests with real LLM calls. A smaller set of 20 invoices runs against the live API. The test doesn't check exact field values. It checks that every required field exists, matches the expected type, and falls within a plausible range (e.g., total amount greater than zero, date within the last 2 years).

Golden-set regression tests. 10 invoices with human-verified correct extractions. Run weekly. If accuracy on the golden set drops below 90%, the deploy fails. This catches model degradation, prompt drift, and changes in the OpenAI API response format.

def test_extraction_schema(sample_invoice):
    """Schema test: fields exist and have correct types."""
    result = run_graph(sample_invoice)
    assert "carrier" in result and isinstance(result["carrier"], str)
    assert "line_items" in result and len(result["line_items"]) > 0
    assert "total_amount" in result and result["total_amount"] > 0
    assert "invoice_date" in result  # exists, don't check exact value
 
def test_golden_set_accuracy():
    """Golden set: accuracy must stay above 90%."""
    correct = 0
    for invoice, expected in GOLDEN_SET:
        result = run_graph(invoice)
        if fields_match(result, expected, tolerance=0.05):
            correct += 1
    accuracy = correct / len(GOLDEN_SET)
    assert accuracy >= 0.90, f"Golden set accuracy dropped to {accuracy:.1%}"

What I'd do differently

I'd start with separate state keys from day one. The "shared dict" pattern feels natural when you're prototyping in a notebook. Two or three nodes, small data, consistent timing. It breaks the moment you add concurrency. If every LangGraph tutorial showed separate keys for parallel nodes, this bug wouldn't exist.

I'd also set up the golden-set regression test before deploying. I built it after the client reported problems, which meant two weeks of production data with unknown quality. The test takes half a day to set up. It would have caught the state corruption on the first production run.

The retry logic mistake cost $340 and a weekend. That one I don't blame myself for. The LangGraph docs don't distinguish between node-level retry and graph-level replay clearly enough. But now I treat them as separate systems with separate failure triggers. Node retry for infrastructure. Graph replay for logic. Never stack them.

The golden-set regression pattern works for any AI system in production. I use the same approach for RAG pipeline retrieval quality, where retrieval accuracy degraded 22% over three months without anyone noticing.

If your team is shipping a LangGraph agent to production, or debugging one that works in notebooks but breaks under load, bring it to a 30-minute call. I'll look at your state graph, identify the parallel write risks, and map out a testing strategy.

Book a discovery call

The LangGraph State Bug That Cost Us 3 Weeks

The setup that looked fine

The state corruption nobody warned me about

The fix: separate state keys

Retry logic: the second trap

When to retry at the node, when to replay the graph

Testing agents that don't return the same thing twice

What I'd do differently

Share this article

Did you like the article?

Related Articles

Your RAG Pipeline Works in Demos. Production Is a Different Story.

Your AI Agent Is Only as Good as the Data Feeding It

AI Agents in Business Processes: 6 Use Cases Already Running in Production