Pseudonymization After SRB/Deloitte

Jun 24, 202623 minute read

Pseudonymization After SRB/Deloitte, Part 2: Re-Identification Is Now an Engineering Test

blogdetail image
Pseudonymization After SRB/Deloitte, Part 2: Re-Identification Is Now an Engineering Test

TL;DR: In Part 1, I argued that identifiability is relative, but not subjective. A dataset may be personal data for one party and non-personal for another, but only where re-identification is objectively not reasonably likely. The next question is technical: what counts as “reasonably likely” when LLMs can extract signals, search auxiliary datasets, reason over candidates, and re-identify at scale?

In my previous article on SRB/Deloitte, I argued that pseudonymization reduces risk, but does not magically remove data from privacy law.

The important point was this:

Identifiability is relative, but it is not subjective.

A dataset may be personal data for one party and non-personal for another. But that conclusion only holds if the second party is objectively unable to re-identify the person, considering the real technical, legal, organizational, and practical conditions around the processing. That was the core argument in Part 1: if a recipient can reasonably re-identify, treat the data as personal; if not, prove why and keep it that way.

This follow-up looks at the technical side of that test.

Because the legal question is no longer enough.

The next question is:

What does “reasonably able to re-identify” mean when modern AI systems can extract identity signals, search candidate datasets, reason over weak matches, and do this at scale?

That is where the debate becomes less about legal labels and more about engineering evidence.

Who this article is for

This article is written for the people who have to translate privacy law into operational decisions:

  • DPOs and privacy officers assessing whether pseudonymized or de-identified data is still personal data.
  • Legal and compliance teams advising on SRB/Deloitte, Recital 26, anonymization, pseudonymization, and recipient-specific identifiability.
  • CISOs, security architects, and risk teams responsible for the controls that keep identity-linking data, keys, logs, and lookup tables separated.
  • Data engineers, AI teams, and analytics teams preparing datasets for modeling, training, reporting, sharing, or vendor processing.
  • GRC, audit, and governance teams who need evidence that de-identification claims are documented, tested, and defensible.
  • Vendors, consultants, and processors who receive coded or de-identified datasets and need to understand when “we cannot identify anyone” is not enough.

The goal is simple:

To move the discussion from legal labels to technical proof - and from assumptions about non-identifiability to evidence that can survive scrutiny.

SRB/Deloitte did not create a pseudonymization loophole

The CJEU’s SRB/Deloitte judgment is often summarized too quickly.

The Court confirmed that pseudonymized data must not be regarded as personal data “in all cases and for every person.” Depending on the circumstances, pseudonymization may effectively prevent persons other than the controller from identifying the data subject.

That matters.

But it does not mean controllers can simply say:

“The recipient cannot identify anyone, so we are outside privacy law.”

The same CJEU press release also makes clear that transparency was assessed from the controller’s perspective at the time of collection. For the controller, the relevant question was whether the data subject was identifiable when the data was collected, before any later transfer or pseudonymization for Deloitte.

So the correct lesson is narrower and more practical:

Pseudonymization may change the legal status of data for a specific recipient in a specific context, but only if re-identification is not reasonably likely for that recipient under the actual conditions of processing.

That means the assessment is contextual.

But it is not a matter of opinion.

Note: While SRB/Deloitte was decided under Regulation 2018/1725 rather than the GDPR directly, the Court’s reasoning on identifiability, pseudonymization, and recipient-specific assessment is highly relevant to GDPR-style analysis. 

France shows the enforcement side of SRB/Deloitte

A recent CNIL sanction decision from France shows why SRB/Deloitte must not be read as a shortcut to anonymity.

In CNIL SAN-2026-008, concerning IQVIA Operations France, the company argued that the data in its LRX and EMR health-data warehouses should be treated as anonymous. Its argument relied partly on SRB/Deloitte: if the notion of personal data is relative and not absolute, then, from IQVIA’s perspective, the data should not be personal data if re-identification would require unreasonable or unlawful means.

CNIL rejected that conclusion.

pseudonymization is a control it is not a conclusion

The decision is important because it applies the same legal architecture this article is built on: Recital 26, reasonable means, available technology, cost, time, auxiliary data, and the actual processing context.

CNIL recalled that pseudonymized data remains personal data where it can be attributed to a person through additional information, and that identifiability must consider all means reasonably likely to be used, including objective factors such as cost, time, available technology, and technological development.

It also recalled the classic anonymization risks from the Article 29 Working Party’s Opinion 05/2014 on Anonymisation Techniques

  • individualization, meaning the ability to isolate records relating to an individual;
  • correlation, meaning the ability to link records relating to the same person or group;
  • inference, meaning the ability to deduce attributes with a high degree of probability.

That matters because those are exactly the risks modern re-identification workflows exploit.

CNIL then drew a clear line between SRB/Deloitte and IQVIA.

In SRB/Deloitte, the question concerned a recipient of pseudonymized comments. In IQVIA, CNIL considered that the company was not merely a simple recipient of pseudonymized data. It was responsible for the overall processing, including the design of the data flows and pseudonymization process. CNIL also emphasized that the data at issue was much richer than the comments in SRB/Deloitte, because it enabled longitudinal tracking of patient care pathways through unique identifiers.

That is the operational lesson.

The same legal concept does not produce the same outcome where the role, dataset,
richness of information, and re-identification context are different.

CNIL also made a point that is central to this article:

The intention to re-identify is not the test.

CNIL stated that the intention, interest, or motivation of the controller or a third party to re-identify is not a relevant factor for deciding whether the data is personal data. What matters is whether the person can be identified directly or indirectly, and whether the controller or a third party has the technical capacity to individualize the person in the dataset.

That is exactly why subjective statements such as “we do not intend to re-identify” or “we have no business reason to re-identify” are not enough.

CNIL then looked at the data itself.

The EMR warehouse included information such as year of birth, sex, marital status, number of children, socio-professional category, visit dates, diagnoses, symptoms, allergies, weight, height, pulse, prescriptions, vaccines, exams, and work stoppages. The LRX warehouse included year of birth, sex, information about prescribing specialists, and prescription information.

That richness matters.

A dataset does not need a name to identify a person if it contains a unique enough pattern of health events, timing, geography, prescriptions, and care history.

CNIL also explained that re-identification risk must be assessed by considering whether pseudonymity can be lifted by reasonable means, including by cross-checking non-directly identifying data with external data, especially information available on the internet. In one example, the rapporteur used public information relating to a rare disease, support groups, chronology, places of care, and medication information to show that a patient could be isolated within the LRX database. CNIL noted that those searches had taken only a few minutes with simple internet access.

This is the practical bridge to AI and LLMs.

If a human rapporteur can identify a risk pattern in minutes using open web information, organizations cannot ignore how much faster and more scalable that kind of matching becomes when supported by modern AI systems.

CNIL also rejected another important argument: that re-identification should be disregarded because it would be contractually prohibited. CNIL considered that a contractual prohibition is not the same as identification being “prohibited by law” in the sense used in Breyer and SRB. Otherwise, actors could too easily escape personal-data rules by agreeing contractually not to re-identify, while the technical risk remains.

That is a key compliance point.

A contract can reduce risk.

It can create obligations.

It can support governance.

But it does not erase technical identifiability by itself.

CNIL’s conclusion was direct: IQVIA’s pseudonymization measures reduced the risk of linking the data to the identity of the persons concerned, but did not remove that risk. The LRX and EMR data therefore had to be treated as personal data subject to GDPR and French data protection law.

Pseudonymization is a control. It is not a conclusion.

And for this article, the French decision gives the real-world regulatory version of the engineering test:

  • Can a person be individualized?
  • Can records be correlated?
  • Can attributes be inferred?
  • Can external data be used?
  • Can open web information make matching practical?
  • Does the dataset’s richness make a person unique?
  • Are the technical controls actually sufficient?
  • Can the controller prove that re-identification is not reasonably likely?

That is the point.

SRB/Deloitte gives the legal nuance.

The French IQVIA decision shows the enforcement reality.

And LLM deanonymization shows why the technical part of that assessment is only becoming more important.

The test is not “can this person do it?”

A dangerous misunderstanding is to turn the recipient test into a subjective capability test.

That sounds like this:

“Our team cannot re-identify the dataset.”
“The vendor says they do not know who the users are.”

Those statements may be relevant.

But they are not enough.

The legal test is not whether one named person, one department, or one vendor contact personally knows how to reverse the dataset.

A recipient’s lack of skill is not the same as objective non-identifiability.

The better question is:

Is re-identification reasonably likely in the real-world context of this processing, considering the data, the recipient, the available tools, auxiliary data, access routes, cost, time, and technological developments?

That is exactly why GDPR Recital 26 is so important. It says identifiability must consider means reasonably likely to be used, including objective factors such as cost, time, available technology at the time of processing, and technological developments.

Those are engineering questions.

They require evidence.

They cannot be answered by a legal label alone.

 legal label vs technical proof

LLMs change the practical meaning of “reasonably likely”

A new study, Large-scale online deanonymization with LLMs, makes this point concrete.

The researchers show that LLMs can perform at-scale deanonymization using pseudonymous online profiles and conversations. The study describes attacks where LLM systems extract identity-relevant features, search candidate matches with semantic embeddings, reason over the best candidates, and reduce false positives. The authors report that their methods work directly on raw user content across arbitrary platforms, not only on structured datasets.

This matters because much privacy thinking is still built around an older model of re-identification.

Old world:

  • Structured datasets.
  • Known quasi-identifiers.
  • Statistical matching.
  • Specialist technical skill.
  • Expensive manual investigation.
  • High-value targets only.

New world:

  • Unstructured text.
  • Comments, reviews, survey responses, transcripts, support tickets, forum posts.
  • LLM extraction of personal attributes.
  • Embedding search over large candidate pools.
  • Reasoning over weak signals.
  • Confidence scoring and abstention.
  • Lower cost and higher scale.

The study’s central point is not that LLMs invent identity signals from nothing.

The point is that LLMs make it cheaper and easier to exploit signals that were already present.

LLMs do not need to invent identity signals. They only need to make existing signals cheaper to exploit.

The authors explain that LLMs can turn deanonymization into a matching problem: extract identity-relevant signals from arbitrary text, search over large candidate pools, and reason about whether two accounts belong to the same person. They also argue that “practical obscurity” no longer holds where deanonymization becomes too cheap to ignore.

That is directly relevant to privacy law.

If available technology reduces the cost and time required for re-identification, then the Recital 26 assessment changes.

Re-identification is not magic. It is a pipeline.

One of the most useful parts of the study is that it breaks deanonymization into a repeatable technical workflow.

The researchers describe four stages:

  1. Extract identity-relevant features from unstructured content.
  2. Search candidate profiles using semantic embeddings.
  3. Reason over top candidates to verify likely matches.
  4. Calibrate confidence to control false positives.

For lawyers, DPOs, compliance officers, and security teams, this is the key translation:

re-indentification is not magic

That is why “de-identified” free text is so dangerous.

A name may be removed.

An email may be removed.

A phone number may be removed.

But the person may still be visible through:

  • unusual job history,
  • location references,
  • project descriptions,
  • writing style,
  • technical stack,
  • education history,
  • timestamps,
  • purchase patterns,
  • rare combinations of attributes,
  • health or family references,
  • complaints,
  • preferences,
  • social relationships,
  • or events that only apply to a small group of people.

This is not theoretical.

In the study, LLM agents were able to identify users from pseudonymous profiles and conversations with measurable precision and recall, and the authors reported that automated approaches could replicate in minutes what might otherwise take hours for a skilled human investigator.

Why free text is a special risk

Structured datasets are already risky.

But free text can be worse.

A structured field might say:

Country: Vietnam
Role: CTO
Industry: SaaS

A free-text response might say:

I moved from Denmark to Ho Chi Minh City, sold my previous hosting company, and now run a privacy SaaS platform focused on consent and compliance.

The second version may not include a name.

A removed name does not remove the person if the pattern still points back to them.

But it contains a very strong identity signature.

This is where de-identification programs often fail.

They focus on obvious identifiers:

  • name,
  • email,
  • phone,
  • account ID,
  • customer number,
  • IP address,
  • device ID.

But they ignore semantic identifiers:

  • rare facts,
  • distinctive timelines,
  • work history,
  • writing patterns,
  • project descriptions,
  • relationships,
  • combinations of attributes,
  • behavioral traces.

LLMs are especially strong at extracting those semantic signals from messy data.

That is why free-text comments, interviews, chat logs, DSAR notes, employee surveys, customer support tickets, medical narratives, complaint forms, and user-generated content should not be casually treated as anonymous after surface-level redaction.

The study itself highlights that LLM-based attacks can operate on unstructured posts and comments, using semi-structured summaries extracted by the model, rather than relying only on structured rating vectors or fixed attributes.

The recipient’s inability is not the same as objective non-identifiability

This is the most important compliance point.

A recipient might honestly say:

“We cannot identify anyone from this dataset.”

But that statement can mean several very different things.

It might mean:

  1. We do not have the direct key.
  2. We do not currently have the internal dataset needed to join it.
  3. We are contractually prohibited from trying.
  4. We are technically prevented from accessing auxiliary data.
  5. We do not have the skill to attempt re-identification.
  6. We have not tested whether re-identification is possible.
  7. We assume it is anonymous because identifiers were removed.

Only some of these support a strong non-identifiability claim.

The strongest claim is not:

“We do not know who these people are.”

The strongest claim is:

“Given the data we receive, the controls in place, the auxiliary data we can lawfully and practically access, the tools reasonably available, and the cost and time required, re-identification is not reasonably likely - and we have evidence to support that conclusion.”

That is a very different statement.

It is an engineering and governance position that must be defensible.

recipient claim vs objective enviroment

“Any party” does not mean imaginary super-attacker

There is also an important limit.

The test is not whether an imaginary intelligence agency, illegal hacker group, or future supercomputer could identify the person in some abstract scenario.

The correct question is more practical:

Is re-identification reasonably likely by the controller, the recipient, or another relevant party in the actual processing context?

That includes:

  • what data exists,
  • who has access,
  • what tools are available,
  • what auxiliary datasets are realistically obtainable,
  • what contracts prohibit,
  • what technical controls prevent,
  • what cost and time are involved,
  • what incentives the parties have,
  • and how technology is evolving.

This is why SRB/Deloitte is useful but not a loophole.

It confirms that the same data may have different legal status for different parties.

But it also forces organizations to prove the factual basis for that conclusion.

The problem with “de-identified” operational data

Many organizations use the word “de-identified” too loosely.

A dataset may be described as de-identified because direct identifiers were removed.

But in practice, the dataset may still contain:

  • stable pseudonymous IDs,
  • hashed emails,
  • device fingerprints,
  • transaction sequences,
  • timestamp patterns,
  • rare location combinations,
  • user journeys,
  • free-text notes,
  • internal case references,
  • demographic combinations,
  • small cohort indicators,
  • or event histories that make people linkable.

This creates a false sense of safety.

The organization believes the dataset is outside privacy law.

The vendor believes it is only receiving non-personal data.

The legal team signs off based on a high-level description.

The engineering team knows the dataset is still joinable.

And suddenly the “de-identified” dataset becomes linkable again.

Not because someone broke encryption.

Not because someone accessed the direct lookup table.

But because enough residual signal remained.

Re-identification risk is a regression risk

A dataset can be low risk today and higher risk tomorrow.

That is one of the most overlooked points in de-identification governance.

Re-identification risk changes when:

  • more auxiliary data becomes available,
  • a vendor adds new datasets,
  • an organization acquires another company,
  • internal systems are integrated,
  • IDs become stable across products,
  • logs are retained longer,
  • AI tools improve,
  • embedding models improve,
  • search agents become cheaper,
  • or new public data appears.

So identifiability is not a one-time conclusion.

It is a state that must be maintained.

This aligns with the practical reading of Recital 26: available technology and technological developments matter.

That is why pseudonymization must be governed as a lifecycle control, not a one-off transformation.

indentifiability changes over time

Quantum computing will make today’s re-identification assumptions expire

There is another future risk that privacy teams need to include in de-identification assessments: quantum computing.

Quantum computing threatens some of the cryptographic assumptions that many
pseudonymization and de-identification controls rely on.

Many organizations protect pseudonymized datasets using controls such as:

  • encrypted lookup tables,
  • encrypted archives,
  • encrypted transfer channels,
  • public-key infrastructure,
  • digital signatures,
  • long-term certificates,
  • tokenization services,
  • encrypted backups,
  • identity-mapping databases,
  • secure enclaves,
  • vendor-to-controller key separation,
  • and cryptographic access controls.

If those controls depend on quantum-vulnerable cryptography, then the re-identification risk assessment has a time horizon problem.

A dataset may be hard to re-identify today.

But if the key material, encrypted mapping table, identity graph, or related auxiliary data can be decrypted or compromised in the future, the same dataset may become linkable again.

That is why quantum risk matters for privacy engineering.

Not because quantum is a direct deanonymization engine.

But because it may weaken the cryptographic barriers that currently keep pseudonymized data separated from identity.

Quantum risk turns de-identification into a time-bound claim. If the cryptographic
barrier can expire, so can the non-identifiability conclusion.

This is already reflected in public guidance.

NIST finalized its first three post-quantum cryptography standards in 2024 and encouraged system administrators to begin transitioning as soon as possible. NIST also warned that some experts predict quantum computers capable of breaking current encryption methods could appear within a decade.

The UK NCSC has published a migration timeline that expects organizations to complete discovery and assessment by 2028, complete highest-priority migration activities by 2031, and complete migration to post-quantum cryptography by 2035. The same guidance explains that future large-scale, fault-tolerant quantum computers will be able to solve the hard mathematical problems that today’s asymmetric public-key cryptography relies on.

The Global Risk Institute’s 2025 quantum threat timeline report states that surveyed experts believe the timeline has accelerated, with a cryptographically relevant quantum computer considered “quite possible” within 10 years and “likely” within 15 years.

So the practical planning window is not “sometime far in the future.”

For high-risk systems, long-retention datasets, critical infrastructure, financial services, health data, government records, and sensitive identity-linking systems, the relevant impact window is already this decade.

Around 2030 is not necessarily the year everything breaks.

But it is a reasonable planning point where first material impacts may appear through migration deadlines, procurement requirements, cryptographic inventory work, vendor upgrades, legacy-system exposure, and early high-priority transitions.

The NSA’s CNSA 2.0 transition guidance, for example, sets 2030 as the exclusive-use target for post-quantum software and firmware signing and for traditional networking equipment such as VPNs and routers in national security systems. It also expects a complete broader transition to quantum-resistant algorithms by 2035.

That matters for re-identification.

Because many organizations treat encrypted or pseudonymized datasets as if the protection is permanent.

It is not.

A defensible de-identification assessment should therefore ask:

  • Does the dataset have a long retention period?
  • Does it contain sensitive or high-value personal data?
  • Does the pseudonymization model rely on encryption, tokenization, public-key infrastructure, or digital signatures?
  • Are lookup tables, mapping keys, identity graphs, or transfer logs retained?
  • Could encrypted data be copied today and decrypted later?
  • Are cryptographic algorithms, key lengths, and certificates quantum-vulnerable?
  • Is there a post-quantum migration plan?
  • Are vendors required to support cryptographic agility?
  • Are de-identification claims reviewed when cryptographic controls change?

This is where privacy, security, and cryptography meet.

If an organization says:

“The recipient cannot re-identify because they do not have the key.”

The next question should be:

“How is the key protected, how long must it remain protected, and is that protection quantum-resilient over the lifetime of the data?”

That is a privacy question.

Not only a security question.

Because if the cryptographic protection fails, the privacy conclusion may fail with it.

Quantum computing therefore reinforces the core argument of this article:

Non-identifiability is not a permanent label. It is a maintained state.

And the longer the data lives, the more important it becomes to treat re-identification risk as a lifecycle risk.

What a defensible de-identification assessment should include

If an organization wants to claim that a dataset is not personal data for a recipient, it should be able to evidence that claim.

A defensible assessment should include at least the following:

1. Dataset description

What is being shared or used?

Include:

  • source system,
  • fields,
  • free-text fields,
  • timestamps,
  • identifiers,
  • cohort size,
  • retention period,
  • intended purpose,
  • recipient,
  • processing environment.

2. Transformation method

What exactly was done?

For example:

  • tokenization,
  • keyed hashing,
  • salted hashing,
  • masking,
  • generalization,
  • suppression,
  • aggregation,
  • perturbation,
  • differential privacy,
  • synthetic data generation,
  • redaction,
  • data minimization.

Avoid vague labels such as “anonymized” unless the claim has been tested.

3. Removed identifiers

Which direct identifiers were removed?

Examples:

  • name,
  • email,
  • phone,
  • address,
  • account ID,
  • IP address,
  • device ID,
  • employee ID,
  • customer number.

4. Remaining signals

Which indirect or semantic identifiers remain?

Examples:

  • role,
  • location,
  • dates,
  • rare attributes,
  • writing style,
  • transaction sequence,
  • product usage pattern,
  • complaint narrative,
  • survey response,
  • project description,
  • behavioral history.

5. Auxiliary data analysis

What could the recipient or another relevant party match against?

Examples:

  • internal CRM,
  • support logs,
  • website analytics,
  • advertising IDs,
  • public LinkedIn profiles,
  • public social media,
  • data broker datasets,
  • breached datasets,
  • supplier data,
  • platform logs,
  • search engines.

6. Recipient capability assessment

What can the recipient realistically do?

Assess:

  • access to keys,
  • access to lookup tables,
  • access to auxiliary data,
  • ability to ask the sender for joins,
  • ability to enrich data,
  • internal analytics capability,
  • AI tooling,
  • contractual restrictions,
  • audit rights,
  • technical barriers.

7. Re-identification testing

Has anyone tested the claim?

Possible tests include:

  • linkage testing,
  • singling-out testing,
  • uniqueness analysis,
  • k-anonymity-style cohort testing,
  • membership inference testing for synthetic data,
  • free-text semantic leakage review,
  • LLM-assisted attribute extraction review,
  • attempted joins against approved auxiliary datasets.

8. Controls that preserve the claim

What keeps the dataset non-identifiable for the recipient?

Examples:

  • no key access,
  • no lookup table access,
  • trust-zone separation,
  • contractual prohibition on re-identification,
  • technical prevention of joins,
  • logging,
  • retention limits,
  • data minimization,
  • aggregation thresholds,
  • identifier rotation,
  • purpose-bound data flows,
  • audit trail.

Quantum and cryptographic resilience:

Where pseudonymization depends on cryptographic controls, the evidence pack should document:

  • algorithms used,
  • key lengths,
  • key management model,
  • lookup-table protection,
  • certificate dependencies,
  • retention period,
  • quantum-vulnerable cryptography exposure,
  • post-quantum migration plan,
  • vendor cryptographic agility,
  • and reassessment triggers.

A claim of non-identifiability is weaker if the control that prevents re-identification is expected to degrade during the lifetime of the data.

9. Review trigger

When must the assessment be repeated?

Triggers may include:

  • new recipient,
  • new purpose,
  • new dataset,
  • new model,
  • new enrichment source,
  • new vendor,
  • new system integration,
  • longer retention,
  • material AI capability change,
  • incident,
  • regulatory request,
  • publication or onward transfer.

10. Approval and evidence trail

Who signed off?

A proper evidence pack should include:

  • legal review,
  • security review,
  • data engineering review,
  • DPO/privacy review,
  • risk owner approval,
  • versioned assessment,
  • test results,
  • supporting technical documentation,
  • audit trail.

This is how privacy teams move from claims to proof.

Practical examples

Example 1: Hashed email analytics

A company shares analytics data with a vendor.

The email address is hashed.

The vendor says it cannot see the email address.

But the hash is stable.

If the vendor can hash its own email list using the same method, purchase matching data, or ask the sender to perform a join, re-identification may be reasonably likely.

This is not anonymous data.

It is personal data protected by a weak pseudonymization pattern.

Example 2: Free-text survey responses

A company removes names from employee survey comments.

One response says:

“As the only Danish engineer in the Ho Chi Minh office who moved here in 2021, I have raised this issue with my manager twice.”

No name is included.

But the person is likely identifiable.

The risk comes from semantic uniqueness, not direct identifiers.

Example 3: Coded consultancy data

A controller sends coded comments to an independent consultancy.

The consultancy receives no names, no lookup table, no stable IDs, no metadata, no raw timestamps, and no lawful or practical access to linking data.

The contract prohibits re-identification.

The technical environment prevents joining.

The sender maintains the key separately.

In that specific context, the data may be non-personal for the consultancy while remaining personal data for the controller.

That is the SRB/Deloitte nuance.

Example 4: AI model training data

A dataset is prepared for AI model training.

Direct identifiers are removed.

But the dataset still includes support tickets, complaint narratives, product usage histories, and location/time patterns.

An LLM or embedding system may extract and link identity-relevant signals.

The de-identification claim should not be accepted without testing.

Why this matters for AI governance

This is not only a GDPR issue.

It is also an AI governance issue.

Organizations are increasingly preparing datasets for:

  • model training,
  • fine-tuning,
  • retrieval systems,
  • analytics,
  • benchmarking,
  • synthetic data generation,
  • customer intelligence,
  • internal copilots,
  • legal AI,
  • compliance automation.

Many of these datasets include “de-identified” operational records.

But AI systems are very good at extracting weak signals.

That creates a governance problem.

The same dataset may look safe to a legal reviewer because the obvious identifiers were removed.

But it may look highly identifying to an AI system because the semantic pattern is unique.

That is why legal, privacy, security, and engineering teams need a shared language.

  • Legal asks: Is the person identifiable?
  • Engineering asks: What signals remain, and what can they be matched against?
  • Privacy asks: Is re-identification reasonably likely in this context?
  • Security asks: What controls prevent it?
  • Compliance asks: Can we prove the conclusion later?

These are not separate questions.

They are one operating model.

The new standard: privacy engineering evidence

The future of de-identification is not a better disclaimer.

It is not a policy that says “anonymous.”

It is evidence.

That evidence should show:

  • what data was transformed,
  • what identifiers were removed,
  • what signals remain,
  • what auxiliary data exists,
  • what tools could reasonably be used,
  • what controls prevent re-identification,
  • what testing was performed,
  • what assumptions were made,
  • when the assessment must be repeated.

This is where privacy programs need to mature.

Pseudonymization is useful.

De-identification is useful.

Aggregation is useful.

Synthetic data can be useful.

Differential privacy can be useful.

But only when the technique matches the risk and the evidence supports the claim.

From Legal Label to Technical Proof

SRB/Deloitte did not make pseudonymization a loophole.

It made the factual assessment more important.

The CJEU confirmed that pseudonymized data is not automatically personal data for every party in every situation. But it also confirmed that the relevant perspective depends on the processing context, and that controller transparency obligations are assessed from the controller’s perspective at collection.

The LLM deanonymization research adds the technical warning.

Re-identification is becoming cheaper, faster, more scalable, and more capable of working on unstructured data. The study shows that LLM systems can extract identity-relevant signals, search candidate profiles, reason over likely matches, and calibrate confidence.

Quantum computing adds a second warning: even where re-identification is prevented today by cryptographic separation, encrypted lookup tables, tokenization systems, key management, or access controls, those protections must be assessed over the lifetime of the data. If the cryptographic barrier can expire, the non-identifiability conclusion may expire with it. 

That changes the practical risk test.

A recipient’s lack of subjective capability is not enough.

A legal label is not enough.

A removed name is not enough.

A hashed ID is not enough.

A vendor assurance is not enough.

The question is:

Can we prove that re-identification is not reasonably likely, given the data, the recipient, the available tools, the auxiliary data environment, the cryptographic controls, the cost and time required, the technology available today, and the lifetime of the data? 

That is where legal interpretation becomes technical privacy engineering.

And that is where the next generation of privacy compliance will be won or lost.

Win consent at collection.
Defend non-identifiability with evidence.
Treat re-identification as an engineering test.

Ronni K. Gothard Christiansen
Technical Privacy Engineer & CEO, AesirX.io

References

Pseudonymization, Re-Identification, and Technical Proof: FAQ

Answer: Yes, pseudonymized data can still be personal data under GDPR if the person can be re-identified using means reasonably likely to be used. Pseudonymization reduces risk, but it does not automatically make data anonymous. Under GDPR Recital 26, organizations must consider factors such as cost, time, available technology, technological developments, auxiliary data, and who has realistic access to keys, lookup tables, or matching datasets. The key question is not whether a name has been removed, but whether the person can still be singled out, linked, inferred, or re-identified.

Answer: SRB/Deloitte confirmed that identifiability is contextual. The same pseudonymized dataset may be personal data for one party and non-personal for another, depending on whether that party can reasonably re-identify the data subject. But the case did not create a pseudonymization loophole. A controller cannot simply say that a recipient lacks the direct key and therefore the data is anonymous. The organization must assess the actual processing context, including access, legal role, technical capability, auxiliary data, controls, and whether re-identification is reasonably likely.

Answer: LLMs change re-identification risk because they make it easier to extract identity signals from unstructured data at scale. A modern LLM workflow can extract personal clues from text, search for candidate matches, reason over weak signals, and calibrate confidence. This means that comments, survey responses, support tickets, reviews, transcripts, and other free-text data may remain identifying even after names, emails, and phone numbers are removed. LLMs do not need to invent identity signals. They only need to make existing signals cheaper and easier to exploit.

Answer: Free-text data is hard to anonymize because it often contains semantic identifiers. These are not obvious fields like name, email, phone number, or account ID. They are patterns such as job history, location references, rare events, writing style, project descriptions, health details, family references, technical stack, timelines, and combinations of attributes. A single detail may not identify a person, but a unique combination can. That is why surface-level redaction is often not enough. A removed name does not remove the person if the remaining pattern still points back to them.

Answer: A defensible de-identification evidence pack should document the dataset, transformation method, removed identifiers, remaining signals, auxiliary data risk, recipient capability, re-identification testing, preserving controls, review triggers, and approval trail. It should also include whether cryptographic controls, lookup tables, keys, tokenization systems, or vendor restrictions are relied on to prevent re-identification. For long-retention or sensitive datasets, the evidence pack should address future risks such as AI-assisted matching and quantum-vulnerable cryptography. The standard is simple: do not claim non-identifiability. Evidence it.

Enjoyed this read? Share the blog!