1 May 2026
The Data Deals Behind Modern AI And the Audit Trail Nobody Has
#AI TRAINING DATA · #DATA LICENSING · #EU AI ACT · #DATA PROVENANCE · #AI GOVERNANCE · #REGULATORY COMPLIANCE · #AI DATA ECONOMY · #TRAINING DATA COMPLIANCE

In 2023, a flurry of deals quietly reshaped the economics of artificial intelligence. The Associated Press. Axel Springer. Shutterstock. Reddit. One by one, content owners struck agreements with AI companies, licensing years of archives, articles, and imagery for use in training large language models. The sums were significant. The deals were historic.
What none of them produced was a standard audit trail.
Ask the lawyers who negotiated those agreements what compliance documentation the AI company can hand to a regulator today: a structured record of what was accessed, under which legal basis, with what purpose restrictions, and what happens if a data subject requests erasure. The answer tends to be a polite variation of: it depends on what we agreed to in the contract.
That answer is about to stop being acceptable.
The Lawsuit That Changed the Conversation
In December 2023, The New York Times filed suit against OpenAI and Microsoft, alleging that millions of articles had been used to train AI models without authorisation. Those proceedings became a landmark in the broader legal debate over AI training data. But the lawsuit did something more immediately consequential than establish case law: it made every institutional data holder in the world ask the same question simultaneously.
If we were to license our data to an AI company, what guarantees would we actually have?
The honest answer, at that moment, was: very few. Deals were being negotiated one contract at a time, by legal teams producing bespoke agreements with no standard structure, no shared audit framework, and no technical mechanism to verify that the terms were being honoured after the ink dried.
This was not the AI companies' fault alone. Nor the publishers'. The infrastructure to do it properly simply did not exist.
What Regulators Are Now Asking
The European Union's AI Act entered into force in August 2024. Article 53, covering providers of general-purpose AI models, includes specific obligations on training data documentation: technical records of what data was used, under which legal basis, and how copyright and opt-out obligations were handled. These obligations came into application in August 2025. Enforcement powers (fines, model access requirements, recalls) arrive in August 2026.
The European Data Protection Board weighed in separately, in a 2024 opinion on AI and data protection, making clear that GDPR consent is a fragile legal basis for AI training precisely because training is generalised, downstream, and practically irreversible. The right to erasure, a cornerstone of GDPR, creates an obligation that AI companies currently have no reliable technical mechanism to honour once data has been incorporated into a model.
Taken together, regulators are asking a deceptively simple question:
Can you show us exactly where your training data came from, under what legal terms, and what happens when a data subject asks to be removed?
For most AI companies operating at scale today, a complete, technically verifiable answer to that question does not exist.
A Simple Question. No Good Answer.
The deals that have been signed are real. The data flowing into AI models is real. The payments are real. What is systematically absent is the layer in between: the compliance infrastructure that would make those transactions auditable, purpose-bound, and revocable.
In practice, this means:
- An AI lab asked to demonstrate training data provenance to an EU national authority typically has, at best, a folder of PDF contracts. No cryptographically verifiable ledger. No purpose-bound access log. No revocation signal.
- A publisher that has licensed its archive to three different AI companies for three different purposes has no real-time view of which access keys are active, which models are hitting which records, and whether the terms are being respected.
- A data subject who exercises their right to erasure triggers, in theory, an obligation to remove their data from training pipelines. In practice, there is currently no mechanism to honour this without a full model retrain.
These are not hypothetical edge cases. They are live legal exposures, growing every month, and the enforcement clock started running in August 2025.
The Supply Side Is Not Convinced
Across Europe, institutional data holders (publishers, hospital consortia, banking archives, national libraries) sit on datasets that AI companies would pay significant sums to access. Many have been approached. The majority have declined.
The reason is not price. It is accountability.
A hospital does not refuse to license anonymised patient outcome data because it wants a larger cheque. It refuses because it cannot verify that "anonymised" will remain anonymised once the data has been processed by an external model. Because it cannot restrict the data to a specific medical research use case and enforce that restriction technically. Because it cannot prove to its patients, or to a regulator, that it retained meaningful control throughout the transaction.
A publisher does not refuse to license its archive because it has a better offer elsewhere. It refuses because its editorial board has objected, because it watched a competitor get sued, and because the contract it was offered gave it no real-time visibility into downstream use after signing.
The supply constraint in the AI training data market is not a demand problem or a pricing problem. It is an infrastructure problem. The rails that would make these transactions safe, traceable, and legally defensible across both sides of the table simply do not yet exist in a standardised, scalable form.
A Pattern We Have Seen Before
This situation has a precedent, and it is worth naming.
The EU's revised Payment Services Directive (PSD2) created a regulatory forcing function in banking: financial institutions were required to share customer account data with licensed third-party providers through open APIs. The demand for data sharing was clear. The need was clear. But there was no standard technical layer to do it safely, consistently, or in a way that regulators could inspect.
Companies like Plaid in the United States and TrueLayer in Europe were built to solve exactly that problem: not to be banks, not to hold money, but to be the authenticated, auditable rail in the middle. The layer that made the transaction happen in a way that all parties, and regulators, could trust.
The same dynamic is forming in AI training data now. A regulatory forcing function exists and is already in application. The demand is real and growing fast. The deals are happening. What is absent is the standard infrastructure layer. We will trace this analogy in full in Part Four of this series.
Why This Series
At Gyld, we have been thinking about this problem for three years, not from the outside as commentators, but as engineers and builders working inside the tightest regulatory environment in the world, designing compliance infrastructure from the ground up.
Over the next six posts, we will walk through the full architecture of this problem: the regulatory requirements in detail, the supply-side constraints, the comparable infrastructure plays that show how this resolves, what the technical solution needs to look like, and what we have been building.
By the end of the series, we will have something to announce.
Start with what is already true: the deals have been signed, the regulations are live, and the audit trail does not exist. That gap is not going to stay empty.