AI/MLFeatured

Your AI Agent Is Only as Good as the Data Feeding It

Dirty data breaks AI agents in specific, expensive ways: wrong RAG retrieval, hallucinations from bad context, garbage extraction outputs. Here is what data readiness for AI actually looks like, with a concrete checklist.

JB
Juliano Barbosa
Published March 10, 2026·11 min read
#AI Agents#Data Quality#RAG#Data Engineering#Data Validation#LLM

I watched a team demo an AI agent that answered customer questions by pulling from their internal knowledge base. The demo looked perfect. In production, it told a customer their contract renewed on January 3rd. The actual date was March 1st. The agent had retrieved a record where renewal_date was stored as 01/03/2025, and the ingestion pipeline parsed it as MM/DD/YYYY for US records and DD/MM/YYYY for UK records in the same column.

One date format inconsistency. That's all it took. A customer ended up on the phone with legal.

AI agents amplify your data problems at machine speed. A human analyst might notice a suspicious date and double-check. An agent serves it to the end user in 200 milliseconds with full confidence. The feedback loop that catches bad data in traditional BI, where someone squints at a dashboard and says "that looks wrong," doesn't exist when an agent answers questions on its own.

This post covers the specific ways dirty data breaks AI agents, what "data readiness for AI" means in practice, and a checklist you can run against your own pipelines before you deploy agents to production.


How dirty data breaks AI agents: four failure modes

1. Wrong RAG retrieval from duplicate and stale records

Most AI agent architectures use retrieval-augmented generation (RAG). The agent queries a vector store, pulls the top-k most similar chunks, and feeds them into the LLM as context.

Here's the problem. If your vector store contains three versions of the same document (a draft from April, a revision from June, and the final from August), the retrieval step might pull the April draft. It scored highest on cosine similarity because the user's query matched a paragraph that got removed in later versions. The agent answers with confidence. It quotes information that was deleted six months ago.

I ran into this on a client project where a product FAQ knowledge base had 4,200 chunks in the vector store. After deduplication and version filtering, 1,600 of those chunks were outdated duplicates. That's 38% of the retrieval surface returning wrong or outdated context.

The fix isn't an LLM problem. It's a data pipeline problem. You need:

  • A deduplication step before embedding (hash-based or content fingerprinting)
  • A valid_until or version field on every document, enforced at ingestion
  • A scheduled job that removes or re-embeds stale records from the vector store
# Before embedding: deduplicate by content hash + keep only latest version
from hashlib import sha256
 
def deduplicate_chunks(chunks: list[dict]) -> list[dict]:
    seen: dict[str, dict] = {}
    for chunk in chunks:
        content_hash = sha256(chunk["text"].encode()).hexdigest()
        existing = seen.get(content_hash)
        if existing is None or chunk["version_date"] > existing["version_date"]:
            seen[content_hash] = chunk
    return list(seen.values())
 
# After: filter by freshness before upserting to vector store
fresh_chunks = [
    c for c in deduplicate_chunks(raw_chunks)
    if c["version_date"] >= cutoff_date
]

2. Hallucinations from inconsistent or conflicting context

LLMs don't pick one source and ignore the rest. When the retrieved context contains contradictory information, the model blends it. Sometimes it picks one. Sometimes it invents a middle ground that exists nowhere in your data.

Real scenario: a support agent pulls two knowledge base articles about the same feature. Article A says the export limit is 10,000 rows. Article B (written for the enterprise tier) says 100,000 rows. The agent tells the user the limit is 50,000 rows. That number doesn't exist in any document. The model averaged the contradiction.

Good luck debugging that after the fact. The generated answer looks plausible. The user doesn't question it. Months later, someone notices the agent has been quoting a limit that never existed.

The data fix: every record in your knowledge base needs a scope or context field (which product tier, which region, which customer segment). Your retrieval query must filter on that scope before similarity search, not after.

3. Garbage extraction from unvalidated schemas

Document extraction agents, things like OCR pipelines that produce structured data, invoice processors, contract parsers, are particularly fragile. The agent reads a PDF, extracts fields into a schema, and passes the structured data downstream.

When input documents don't match your expected schema, the extraction doesn't fail cleanly. It fails silently. The agent extracts a total_amount field from a line that was actually a subtotal. It maps a Canadian postal code into a US ZIP code field. It reads a table header as a data row.

I've seen this pattern across three different OCR-to-structured-data pipelines. The extraction success rate looked great at 94%. But when I audited the "successful" extractions, 12% had at least one field mapped to the wrong source value. The pipeline reported success because the field was populated. It just contained the wrong data.

The fix is schema validation after extraction, not just a check that fields are present:

from pydantic import BaseModel, field_validator
from datetime import date
 
class InvoiceExtraction(BaseModel):
    invoice_number: str
    vendor_name: str
    total_amount: float
    currency: str
    invoice_date: date
 
    @field_validator("total_amount")
    @classmethod
    def amount_must_be_positive(cls, v: float) -> float:
        if v <= 0:
            raise ValueError(f"total_amount must be positive, got {v}")
        return v
 
    @field_validator("currency")
    @classmethod
    def currency_must_be_valid(cls, v: str) -> str:
        valid = {"USD", "EUR", "GBP", "CAD", "BRL"}
        if v not in valid:
            raise ValueError(f"Unknown currency: {v}")
        return v
 
    @field_validator("invoice_number")
    @classmethod
    def invoice_number_format(cls, v: str) -> str:
        if len(v) < 3 or len(v) > 30:
            raise ValueError(f"Suspicious invoice number length: {len(v)}")
        return v

Every extraction should pass through a Pydantic model (or equivalent typed validator) before it touches your database. Fields that fail validation get routed to a human review queue, not silently accepted.

4. Wrong tool calls from mismatched metadata

Agents that use function calling or tool use rely on metadata to decide which tool to invoke and what parameters to pass. If your API descriptions, parameter schemas, or example data are stale, the agent picks the wrong tool or passes malformed arguments.

One team had an agent that could query both a "current inventory" API and a "historical inventory" API. The description field on the current inventory tool still said "returns inventory levels" without specifying the time range. The agent defaulted to the historical endpoint for real-time inventory checks because the query mentioned a specific date ("what's the stock level today?"), and the historical endpoint's description mentioned dates.

Tool metadata is data. It needs the same version control, validation, and review process as everything else in the pipeline.


What "data readiness for AI" actually means

I hear teams say "our data is pretty clean" before every AI deployment. Then I audit and find 15-30% of records have at least one quality issue that would affect agent behavior. "Pretty clean" isn't a measurable state.

Data readiness for AI agents is specific. It means five things.

Validated schemas at every boundary. Every data source that feeds your agent needs an explicit schema definition, and I don't mean just column names and types. You need constraints: allowed values, ranges, formats, nullability rules. When a record violates the schema, it gets rejected or quarantined. Not silently passed through.

Freshness SLAs that match agent response time. If your agent answers questions in real time but the data behind it refreshes every 24 hours, users get stale answers and lose trust. Every data source needs a documented freshness target, and the pipeline needs automated checks that alert when freshness degrades.

Deduplication before the agent layer. Duplicates in a traditional data warehouse cause inflated metrics. Annoying, but survivable. Duplicates in an agent's retrieval context cause contradictory answers and hallucinations. The cost is higher when an AI agent is consuming the data.

Lineage from source to agent context. When the agent gives a wrong answer, you need to trace it back: which chunk was retrieved, from which document, ingested when, transformed how. Without lineage, debugging agent failures is guesswork.

Quality gates in the pipeline, not after. Data quality checks need to run before data enters the vector store or the agent's accessible tables. Post-hoc monitoring catches problems after users have already received bad answers. That's too late.


Data readiness checklist

Run this against every data source your AI agent will consume. Each item is a yes/no.

Schema validation

  • [ ] Every source has a typed schema definition (Pydantic, JSON Schema, Avro, Protobuf)
  • [ ] Schema validation runs on every record at ingestion time
  • [ ] Validation failures route to a quarantine table, not to the agent
  • [ ] Schema changes upstream trigger alerts (not silent failures or silent passes)

Freshness

  • [ ] Every data source has a documented freshness SLA (e.g., "within 15 minutes" or "daily by 06:00 UTC")
  • [ ] Automated freshness checks run on schedule and alert on breach
  • [ ] The agent's UI or API communicates data freshness to end users (a "last updated" timestamp)

Deduplication

  • [ ] Content-based deduplication runs before data enters the vector store
  • [ ] Document versioning is tracked, and only the latest version is retrievable
  • [ ] A scheduled job prunes stale or superseded records

Data quality gates

  • [ ] Null rates, cardinality, and value distributions are checked per column on each pipeline run
  • [ ] Range checks and format validation run on business-specific fields (dates, amounts, codes)
  • [ ] Quality check failures block downstream propagation to the agent layer

Lineage and traceability

  • [ ] Every record the agent can access has a source identifier and ingestion timestamp
  • [ ] Vector store chunks link back to their source document and version
  • [ ] Agent responses can be traced to the specific retrieved context (log the chunk IDs)

Metadata for tool use

  • [ ] Tool/API descriptions are version-controlled alongside the code
  • [ ] Parameter schemas include examples and constraints, not just types
  • [ ] Tool metadata is reviewed when the underlying API changes

If you checked fewer than 80% of these boxes, your data layer isn't ready for production AI agents. The agent will work in demos. It will fail in production on the edge cases your data doesn't cover.


The sequence matters: fix the data pipeline first

I talk to teams who want to skip the data work and go straight to agent development. The reasoning sounds logical: "We'll fix data quality issues as they come up." In practice, your agent goes live, users report wrong answers, and the team spends the next three months in reactive firefighting mode while trust erodes.

The better sequence:

  1. Audit your data sources. Profile every source the agent will consume. Measure null rates, duplicate rates, freshness gaps, schema violations. This takes 1-2 weeks depending on the number of sources.

  2. Build quality gates into the pipeline. Add schema validation, deduplication, and freshness checks before data reaches the agent layer. Use tools you already know: Great Expectations, dbt tests, Pydantic validators, Databricks quality rules.

  3. Set up lineage and logging. Make sure you can trace any agent answer back to the data that produced it. When (not if) the agent gives a wrong answer, you need to diagnose it in minutes, not days.

  4. Then build the agent. With clean, validated, traceable data, agent development goes faster. You spend time on prompt engineering and workflow design instead of debugging data issues disguised as model issues.

This sequence adds 2-4 weeks upfront. It saves 2-4 months of post-launch debugging. Every team I've worked with that skipped step 1 ended up doing it later, under pressure, with users already losing confidence in the agent.


The honest part: most "AI failures" are data failures

After building data pipelines and AI agent workflows across 15+ projects, the pattern is consistent. When an agent gives a wrong answer, the instinct is to blame the model. Swap to GPT-4o. Fine-tune. Add more prompt engineering.

80% of the time, the problem is upstream. The model received bad context because the data pipeline let a stale record through. The extraction failed silently because nobody validated the schema. The retrieval pulled a duplicate because nobody deduplicated the vector store.

AI agents are a forcing function for data quality. They expose every data problem you've been tolerating in your warehouse, your knowledge base, your APIs. Teams that treat AI deployment as a data engineering project first and an ML project second are the ones shipping agents that actually work in production.


If your team is planning an AI agent deployment and the data layer isn't ready, or if you have agents in production giving wrong answers and you suspect the data pipeline, bring it to a 30-minute call. I'll look at your current architecture and tell you where the data gaps are and what to fix first.

Book a discovery call →

Share this article

Did you like the article?

Subscribe to receive our next articles about Data Engineering, AI/ML, and Cloud Platforms.

We respect your privacy. You can unsubscribe at any time.

Related Articles

Data EngineeringFeatured

Your Business Runs on Excel. That Should Scare You.

How logistics and manufacturing companies lose $100K+ per year to manual Excel processes, and the exact cloud architecture that replaces them with automated, auditable pipelines on Databricks, Snowflake, or BigQuery.

March 10, 2026·11 min read
Excel MigrationCloud Automation+5
AI/MLFeatured

AI Agents in Business Processes: 6 Use Cases Already Running in Production

AI agents are processing invoices, triaging support tickets, and catching supply chain anomalies at mid-market companies right now. Here are 6 production patterns, the data infrastructure behind them, and the results teams are reporting.

March 10, 2026·18 min read
AI AgentsLangGraph+4