Data Provenance: The Unsexy Discipline Deciding Who Wins the AI Era

Every dataset tells a story. The question that increasingly decides lawsuits, audits, and procurement decisions is simpler than most analytics: can you prove where each number came from?

That question has a name, data provenance, and after years as an academic afterthought it is becoming the defining requirement of enterprise AI. The clearest demonstration is happening right now in American healthcare, where the ability to trace a single score back to its source evidence is separating organisations that pass billion-dollar audits from organisations that write cheques to the government.

Table of Contents

A score with consequences

The score in question is the RAF score, short for risk adjustment factor. Every member of a private US Medicare plan has one. It compresses their recorded diagnoses into a single number that determines the monthly payment their insurer receives from the government. Tens of millions of members, hundreds of billions of dollars, all flowing through one derived metric.

Now apply the provenance question. A RAF score is the output of a pipeline: doctor writes notes, coder translates notes into diagnosis codes, codes feed a government model, model emits a score. Each stage is a transformation where information can be lost, exaggerated, or invented. For years, the industry optimised the pipeline for output, higher scores, without instrumenting it for traceability.

The reckoning arrived on schedule. Federal auditors published reviews in March 2026 showing that at three audited plans, 81 to 91 percent of sampled high-risk diagnosis codes could not be traced to adequate supporting records. The most common failure was a provenance failure in miniature: a condition a patient once had, coded as if currently active. The data said “is”. The source said “was”. Nobody had checked the lineage.

The financial consequences are no longer theoretical. One major insurer settled with the US Department of Justice for 117.7 million dollars over review programmes that added codes without validating or removing existing ones. Audit findings are extrapolated from samples to entire contracts, so a small traced error becomes a large recovered payment.

What defensible data actually looks like

Out of this mess, a useful engineering standard is emerging. The industry calls it defensibility, and it is provenance made operational. A diagnosis code is defensible when four links in its chain are intact and inspectable.

It is encounter-linked: tied to a real, dated interaction between a clinician and the patient, not to a retrospective trawl through old files. It is evidence-based: the clinical note contains documentation that actually supports the condition under explicit criteria. It is explainable: whatever system, human or AI, proposed the code can show the reasoning path from note to code. And it is auditable: the whole chain is stored so that a third party can reconstruct it years later without interviewing anyone.

Health plans are now rebuilding their pipelines around those four properties, and the practical guide to building defensible RAF scores has become required reading for teams making the transition. The striking thing for a technical audience is how closely it maps to good data engineering anywhere: immutable source records, transformation logs, lineage metadata, and reproducibility.

The AI twist

Provenance was hard enough when humans did the transformations. AI makes it existential.

Modern language models can read a decade of clinical notes in seconds and propose diagnosis codes with impressive recall. But a model that outputs a code without a traceable justification produces exactly the liability regulators are hunting: conclusions with no lineage. In audit terms, an unexplainable AI is a machine for generating indefensible data at scale.

That is why the healthcare AI vendors gaining ground are the ones whose systems emit evidence with every inference: the sentence in the note, the rule it satisfies, the confidence and the checkpoint where a human confirmed it. The model is free to be sophisticated. The output must be boring, inspectable, and reproducible. American regulators have blessed this division of labour explicitly, describing AI as a support tool whose findings humans must validate, with the government itself using some two thousand certified coders to make final determinations in its expanded audits.

The same architecture is about to be demanded everywhere. The EU AI Act requires traceability and human oversight for high-risk systems. Financial regulators expect model decisions to be explainable to auditors. Even advertising platforms face provenance questions about training data. Healthcare is simply the industry paying the bill first and most publicly.

The takeaway for data teams

If you build or buy data systems, the American healthcare story compresses into three habits worth stealing.

Instrument lineage from day one; retrofitting provenance after an incident is archaeology, and archaeology is expensive. Treat removals as seriously as additions; a pipeline that only ever finds errors in the profitable direction is not a quality system, and regulators in every sector have learned to spot the asymmetry. And make explanation a product requirement for any AI in the loop, because “the model said so” is now a failing answer in any room that matters.

Provenance will never trend. But in the era where every important number is machine-assisted, the organisations that can show where their numbers came from will quietly collect the trust, the contracts, and the audit outcomes. Everyone else gets to explain the gaps.