AI Agents in Business Processes: 6 Use Cases Already Running in Production

Most companies talking about AI agents haven't shipped one. The ones that have aren't talking about it on LinkedIn. They're too busy watching a LangGraph workflow process 2,000 invoices a day while a human team of four handles the exceptions.

That gap between the conference-talk version and the production version is where the real work lives. I've built agent workflows for document extraction, data validation, and operations automation across multiple industries over the past year. The pattern repeats everywhere: a multi-step workflow, a retrieval layer, a set of tool calls, and a human review queue for low-confidence outputs.

This post covers six use cases where AI agents run in production today. Not prototypes. Not demos. Workflows handling real business volume, with architecture patterns and measurable results.

What "AI agent" actually means in production

Quick alignment on terms before the use cases.

When I say "AI agent" in a business process context, I mean a multi-step LLM-powered workflow that receives structured or unstructured input, makes decisions at branch points, calls external tools or APIs, and produces a structured output. The agent has a defined state graph. Each node performs one task: classify, extract, validate, route. Edges between nodes are conditional. The workflow can loop back to re-extract when confidence is low, or escalate to a human on edge cases.

This is different from a chatbot. A chatbot responds to user prompts. An agent runs autonomously on incoming data, makes routing decisions without human input (most of the time), and writes results to a database or triggers a downstream process.

The common implementation pattern: LangGraph (or a similar state machine framework) orchestrates the nodes. Each node calls an LLM with a specific prompt and tool set. The state object carries context between nodes. A vector store provides retrieval-augmented generation (RAG) for domain-specific knowledge. Tool-calling lets the agent interact with APIs, databases, and file systems.

1. Document extraction and classification

A logistics company receives 800+ documents per day. Bills of lading, customs forms, packing lists, commercial invoices. Each document type has a different schema. A team of 12 people manually reads, classifies, and keys data into the ERP. Error rate: 4-6%. Turnaround: 24-48 hours.

The workflow is a 5-node LangGraph state graph:

The OCR node receives the document image or PDF. It runs Azure Document Intelligence or Textract to extract raw text and bounding boxes. Output: structured text blocks with positional metadata.
The classification node takes the OCR output and classifies the document type using an LLM with few-shot examples stored in a prompt template. Confidence score attached.
The extraction node applies a type-specific extraction prompt based on the classified type. The LLM extracts fields into a typed JSON schema. For bills of lading: shipper, consignee, port of loading, port of discharge, container numbers, weights.
The validation node cross-references extracted values against reference data in a Postgres lookup table. Does the port code exist? Does the container number match the expected format? Is the weight within plausible range for this commodity type?
The routing node decides the output path. If confidence > 0.92 and all validations pass, the extracted data writes directly to the ERP staging table. If not, the document routes to a human review queue with the agent's extraction pre-filled and the low-confidence fields highlighted.

from langgraph.graph import StateGraph, END
from typing import TypedDict, Literal
 
class DocState(TypedDict):
    raw_image: bytes
    ocr_text: str
    doc_type: str
    confidence: float
    extracted_fields: dict
    validation_errors: list[str]
 
def route_after_validation(state: DocState) -> Literal["auto_commit", "human_review"]:
    if state["confidence"] > 0.92 and len(state["validation_errors"]) == 0:
        return "auto_commit"
    return "human_review"
 
graph = StateGraph(DocState)
graph.add_node("ocr", ocr_node)
graph.add_node("classify", classify_node)
graph.add_node("extract", extract_node)
graph.add_node("validate", validate_node)
graph.add_node("auto_commit", commit_to_erp)
graph.add_node("human_review", send_to_review_queue)
 
graph.set_entry_point("ocr")
graph.add_edge("ocr", "classify")
graph.add_edge("classify", "extract")
graph.add_edge("extract", "validate")
graph.add_conditional_edges("validate", route_after_validation)
graph.add_edge("auto_commit", END)
graph.add_edge("human_review", END)

85% of documents now process without human intervention. The remaining 15% arrive at the reviewer pre-filled, which cuts review time from 8 minutes to 90 seconds. Total manual work dropped 85%. Error rate went from 4-6% to under 1%.

One thing I underestimated on this build: OCR quality variance between document types. Bills of lading from major shipping lines came through clean. But customs forms from smaller carriers? Scanned at odd angles, faded ink, handwritten annotations. The classification node kept misclassifying them because the OCR text was garbage in the first place. I had to add a pre-processing step with image normalization (rotation correction, contrast enhancement) before OCR. That single node added 2 weeks to the timeline.

You'll need: object storage for document landing zone (S3 or Azure Blob), OCR API, vector store for few-shot examples (Pinecone or pgvector), Postgres for reference data and validation lookups, a message queue (SQS or Azure Service Bus) for burst volume.

2. Automated data quality validation

A financial services company runs 200+ dbt models nightly. Data quality checks catch issues after the fact: a test fails at 4 AM, an engineer triages it at 9 AM, the root cause takes another 2 hours to find. By then, the morning reports went out with bad numbers.

This agent sits between the dbt run and the downstream consumers. It's not replacing Great Expectations or dbt tests. It adds an LLM-powered investigation layer on top.

When a dbt test fails or a data quality metric drifts, the agent kicks off four steps:

Collect context. Pull the failing test name, the SQL that defines it, the model lineage (upstream sources), and a sample of the rows that triggered the failure.
Investigate upstream. Query Bronze and Silver tables to check whether the issue originated upstream. Did the source system send nulls where it never has before? Did row counts spike or drop?
Classify the issue. The LLM uses RAG over past incident reports to classify the failure: source schema change, late-arriving data, duplicate records, upstream system outage, or business logic drift.
Recommend action. Based on the classification, the agent writes a structured incident summary: rerun the model, exclude the affected partition, escalate to the source team, or flag as a known pattern that will self-resolve.

The agent posts this analysis to Slack within 3 minutes of the test failure. The on-call engineer wakes up to a pre-investigated incident instead of a raw alert.

Mean time to resolution dropped from 3.5 hours to 45 minutes. False escalations (where the issue self-resolves) dropped 60% because the agent learned to recognize late-arriving data patterns.

I'll be honest: the RAG corpus was the hard part here. The first version indexed raw Slack messages from the #data-incidents channel. Turns out, Slack threads are terrible retrieval documents. Too much context-switching, too many tangential replies. I had to build a pipeline that extracted structured incident summaries from those threads and indexed those instead. Night and day difference in retrieval quality.

Infra: dbt metadata API or artifact parsing, a data warehouse with query access across Bronze/Silver/Gold layers, a vector store indexed on past incident reports (this is the RAG corpus), Slack API integration, a state store (Redis or DynamoDB) for tracking open incidents.

3. Customer support triage and response drafting

A SaaS company with 4,000 customers receives 300+ support tickets per day across email and their help desk. Tier 1 agents spend 40% of their time categorizing tickets and routing them. Another 30% goes to drafting responses for known issues that already have documented answers.

Two-stage workflow. Stage one handles classification and routing. Stage two drafts the response.

Triage

The agent reads the incoming ticket text, customer metadata (plan tier, recent activity, open tickets), and product area signals. It classifies on two axes: urgency (P1 through P4) and category (billing, technical, feature request, account access, integration). Then it routes to the correct team queue.

The classification uses an LLM with tool-calling:

get_customer_profile(customer_id) returns plan, MRR, tenure, recent tickets
get_recent_incidents() returns currently known outages or degraded services
search_knowledge_base(query) runs a RAG search over the help center and internal runbooks

If the customer is on the enterprise plan with $50K+ ARR and the issue maps to a known outage, the agent auto-escalates to P1 and tags the account team. That routing used to take 20 minutes of human judgment. Now it takes 4 seconds.

Response drafting

For tickets that match known issues (roughly 55% do), the agent drafts a response using RAG over the knowledge base. The draft includes specific resolution steps, links to relevant docs, and any workarounds. A human agent reviews the draft, edits if needed, and sends.

# Tool-calling pattern for triage agent
tools = [
    {
        "type": "function",
        "function": {
            "name": "get_customer_profile",
            "description": "Get customer metadata including plan tier, ARR, tenure",
            "parameters": {
                "type": "object",
                "properties": {
                    "customer_id": {"type": "string"}
                },
                "required": ["customer_id"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "search_knowledge_base",
            "description": "Search help center and internal runbooks for matching articles",
            "parameters": {
                "type": "object",
                "properties": {
                    "query": {"type": "string"},
                    "top_k": {"type": "integer", "default": 5}
                },
                "required": ["query"]
            }
        }
    }
]

Average first-response time dropped from 4.2 hours to 35 minutes. Ticket misrouting decreased 70%. Human agents handle 40% more tickets per day because they start from a pre-drafted, context-rich response instead of a blank text box.

You'll need: help desk API (Zendesk, Intercom, or Freshdesk), vector store indexed on knowledge base articles and past resolved tickets, customer data warehouse access, webhook or polling integration for real-time ticket ingestion.

4. Supply chain anomaly detection

A manufacturing company tracks 1,200 SKUs across 8 distribution centers. Demand planning runs weekly in a spreadsheet. When a supplier misses a shipment or demand spikes unexpectedly, the operations team finds out when a warehouse runs out of stock. Stockouts cost them $180K per incident on average.

This one is a hybrid: a statistical anomaly detection model feeds signals to an LLM-powered investigation agent.

The detection layer runs every 15 minutes. A time-series model (Prophet or a simple z-score on a rolling 90-day window) monitors inventory levels, inbound shipment ETAs, and order velocity per SKU per warehouse. When a metric crosses the anomaly threshold, it fires an event.

The investigation agent picks up the anomaly event and runs a 4-node workflow:

Context gathering. Pull the anomaly details, the SKU's historical pattern, supplier lead times, and any open purchase orders from the ERP.
Root cause hypothesis. The LLM generates 2-3 possible explanations. "Supplier X has missed the last 2 deliveries for this component." "Order velocity spiked 3x in the Southeast region, consistent with the seasonal pattern from last year but 2 weeks earlier." "Warehouse B reported a receiving delay due to dock capacity."
Impact assessment. Calculate days-of-supply remaining at current burn rate. Cross-reference against pending orders and committed delivery dates. Flag customer orders at risk.
Recommendation. Output a structured alert: shift inventory from Warehouse A (14 days of supply) to Warehouse B (2 days of supply), expedite PO #4472 with supplier, or place an emergency order. The alert goes to the supply chain manager with all context attached.

Detection-to-action time went from 3-5 days (waiting for the weekly review) to 2 hours. Stockout incidents dropped 40% in the first quarter. The operations team now reviews 15-20 agent-generated alerts per day instead of scanning dashboards manually.

You'll need: a real-time or near-real-time inventory data pipeline (Kafka or a CDC stream from the ERP into a lakehouse), time-series storage (Databricks with Delta Lake works well), ERP API access for purchase orders and supplier data, a scheduling layer (Airflow or Databricks Workflows) for the detection model runs.

5. Invoice processing and three-way matching

An operations-heavy company processes 3,000+ vendor invoices per month. Each invoice must match against a purchase order and a goods receipt before payment. The AP team does this manually: open the invoice PDF, find the PO number, pull up the PO in the ERP, compare line items, check quantities and unit prices, verify the goods receipt matches. One invoice takes 12-15 minutes. Errors (duplicate payments, incorrect amounts) cost the company $200K+ per year.

The agent automates the three-way match: invoice to PO to goods receipt.

Extraction. OCR pulls vendor name, invoice number, PO number, line items (description, quantity, unit price, total), tax, and payment terms from the PDF.
PO lookup. Tool call to the ERP API: get_purchase_order(po_number). Returns the PO line items.
Goods receipt lookup. Tool call: get_goods_receipts(po_number). Returns received quantities by line item.
Matching logic. The agent compares each invoice line item against the PO and the goods receipt. Does the invoiced quantity match the received quantity? Does the unit price match the PO price? Are there line items on the invoice not present on the PO?
Tolerance check. If the total variance is within the configured tolerance (typically 1-2%), the invoice is approved for payment automatically. If outside tolerance, or if any line item doesn't match, the agent routes to AP with a detailed variance report showing exactly which lines don't match and why.

The matching logic is a mix of deterministic rules and LLM reasoning. The deterministic layer handles exact matches and simple arithmetic. The LLM handles fuzzy cases: the invoice says "Stainless Steel Bolts M8x30 Grade A2" and the PO says "SS Bolt M8 30mm A2-70." An embedding similarity check plus an LLM confirmation handles those description mismatches without rejecting valid invoices.

72% of invoices auto-approved without human touch. Processing time per invoice dropped from 14 minutes to 90 seconds (for the 28% that need review). Duplicate payment errors dropped to near zero because the agent cross-references all invoices for the same PO and flags re-submissions.

Infra: OCR API, ERP API access (for PO and goods receipt data), an embedding model for description matching, a workflow orchestrator, a database for match results and audit trail. The audit trail is non-negotiable for AP workflows. Every decision the agent makes must be logged with the inputs, the confidence score, and the reason.

6. Contract review and clause extraction

A company's legal team reviews 50+ vendor contracts per month. Each contract gets checked for non-standard terms: liability caps, indemnification clauses, auto-renewal, termination notice periods, data processing addendums. A junior associate spends 2-3 hours per contract on first-pass review. The backlog never shrinks.

The agent does first-pass review. It doesn't replace the attorney. It hands the attorney a structured summary with flagged clauses instead of a 40-page PDF.

Parsing comes first. The contract PDF is chunked into sections using heading detection and page structure analysis. Each section becomes a document chunk with metadata (section title, page number).

Then clause extraction via RAG. For each target clause type (typically 12-15 in the review checklist), the agent runs a targeted retrieval query against the chunked contract. It pulls the most relevant chunks, then uses an LLM to extract the specific clause language and classify it as standard, non-standard, missing, or ambiguous.

Each extracted clause gets a risk score based on deviation from the company's preferred terms. The preferred terms live in a vector store as the RAG corpus. "Liability capped at 12 months of fees" is standard. "Unlimited liability" gets flagged as high risk. "No liability cap mentioned" gets flagged as missing.

Finally, the agent produces a structured output: a table with each clause type, the extracted language, the risk score, and the page reference. High-risk items float to the top.

# Clause extraction with targeted RAG retrieval
clause_types = [
    "liability_cap",
    "indemnification",
    "termination_notice",
    "auto_renewal",
    "data_processing",
    "governing_law",
    "assignment",
    "force_majeure",
    "confidentiality_term",
    "ip_ownership",
    "payment_terms",
    "warranty_disclaimer",
]
 
async def extract_clauses(contract_chunks: list[str], clause_types: list[str]):
    results = []
    for clause_type in clause_types:
        # Retrieve relevant chunks from the contract
        relevant_chunks = await vector_store.similarity_search(
            query=f"Find the {clause_type.replace('_', ' ')} clause",
            documents=contract_chunks,
            top_k=3,
        )
 
        # Extract and classify using LLM
        extraction = await llm.invoke(
            system=CLAUSE_EXTRACTION_PROMPT,
            context="\n".join(relevant_chunks),
            clause_type=clause_type,
            preferred_terms=await get_preferred_terms(clause_type),
        )
        results.append(extraction)
    return results

First-pass review time dropped from 2.5 hours to 20 minutes per contract. The attorney now spends time on judgment calls (negotiation strategy, business risk assessment) instead of reading boilerplate. Non-standard clause detection accuracy: 94% after 3 months of feedback-loop tuning on the company's specific contract corpus.

Infra: PDF parsing and chunking pipeline, a vector store with two indices (one for the contract being reviewed, one for the company's preferred terms and past contracts), an LLM with enough context window for multi-page clause sections (Claude or GPT-4 class), a review UI where attorneys can accept, reject, or correct the agent's extractions. That correction data feeds back into the RAG corpus, which is how accuracy went from ~80% at launch to 94%.

What you need before building agents

Every use case above depends on the same foundation. Skip it, and the agent will be a good demo and a bad production system.

Clean, accessible data

The agent needs to read from your systems of record. If your ERP data is trapped in a monolith with no API, the agent can't do three-way matching. If your customer data lives in three different tools with no unified view, the triage agent will misroute tickets.

Before building an agent, ask: Can I query the data this agent needs in under 500ms? Is that data fresh enough for the use case? Is it clean enough that the agent won't waste half its LLM calls trying to parse garbage?

A lakehouse architecture (Bronze/Silver/Gold on Databricks, Snowflake, or BigQuery) solves most of this. Bronze lands raw data. Silver cleans, deduplicates, and standardizes it. Gold serves the consumption-ready tables that agents query. If your Silver layer is a mess, fix that first.

A retrieval layer (RAG infrastructure)

Five of the six use cases above use RAG. The agent doesn't hallucinate answers because it retrieves actual documents, past incidents, knowledge base articles, or contract terms before generating output.

You need: a vector store (pgvector is fine for most workloads, Pinecone or Weaviate if you need managed scale), an embedding pipeline that keeps the index fresh, and a chunking strategy that matches your document types. Bad chunking is the number one reason RAG produces irrelevant results. A 4,000-token chunk from the middle of a contract with no section metadata is useless. A 500-token chunk with the section title and page number attached is gold.

Observability and human-in-the-loop

Every production agent I've built has a confidence threshold. Below that threshold, the output goes to a human. Non-negotiable.

You also need logging on every agent step: what the LLM received, what it returned, what tools it called, what the tools returned. When an invoice gets auto-approved incorrectly, you need to trace exactly why. LangSmith, Arize, or a custom logging pipeline to your lakehouse all work. What doesn't work: deploying an agent with no traceability and discovering it approved 40 duplicate invoices three weeks later.

Workflow orchestration

Agents are workflows. They need scheduling, retry logic, dead-letter queues, and concurrency controls. LangGraph handles the state graph, but you still need something to trigger runs, manage parallelism, and handle failures. Airflow, Databricks Workflows, or Prefect for batch. A message queue (SQS, Kafka, Azure Service Bus) for event-driven triggers.

Cost controls

LLM calls cost money. An agent that runs 5 LLM calls per document across 2,000 documents per day is 10,000 LLM calls. At $0.01-0.03 per call (depending on the model and token count), that's $100-300/day. Acceptable for a process that previously required 12 people. Not acceptable if your extraction prompt is poorly written and you're burning tokens on retries.

Track cost per agent run. Set alerts. Use cheaper models (Claude Haiku, GPT-4o-mini) for classification and routing. Save the expensive models for extraction and judgment steps.

The pattern across all six

Every production agent I've built follows the same shape:

Structured input arrives (document, ticket, data quality alert, anomaly event)
The agent classifies and routes
The agent retrieves context (RAG, API calls, database lookups)
The agent produces structured output (extracted fields, a risk score, a recommendation)
High-confidence outputs auto-commit. Low-confidence outputs go to humans, pre-filled.

The LLM is the reasoning engine. Everything around it, the data pipeline, the vector store, the tool integrations, the observability layer, that's data engineering. Companies that try to build agents without investing in that foundation get a prototype that works on 10 documents and breaks on 1,000.

I've seen it happen twice in the past year. Teams spend 3 months building the agent logic, get a killer demo, then realize the data it needs is locked in a legacy system with no API. The agent project becomes a data integration project overnight. If you start with the data layer, the agent part takes weeks, not months.

If your team is evaluating AI agents for document processing, data quality, or operations automation and you want to skip the "impressive demo that doesn't scale" phase, I can help. I build the data infrastructure and the agent workflows as one project, because that's what they are.

Book a 30-minute discovery call