Scenario 3 of 10 · AIGP Scenarios

GDPR Data Subject Access Request for AI Training Data

⏱ 12 min 📊 Advanced 🎯 Domain II

In 2023, the Italian Data Protection Authority (Garante) temporarily banned ChatGPT over concerns about GDPR compliance, including the lack of a legal basis for processing personal data used in training and the inability to correct inaccurate outputs. This landmark action raised a question every AI company must now answer: what happens when individuals exercise their data rights against AI systems trained on their personal data? This scenario puts you at the center of that challenge.

Flowchart showing a GDPR data subject access request flowing through an AI company's data governance and legal teams

GDPR data subject rights create unique challenges when personal data has been absorbed into AI model weights.

The Situation

You are the Data Protection Officer at Lingua360, a London-based AI company that develops multilingual translation and content generation tools. Lingua360's models were trained on a dataset of 2.1 billion text samples scraped from the public web, licensed datasets, and customer-contributed data from enterprise clients who opted into a data improvement program.

Today you received a formal GDPR Article 15 Subject Access Request (SAR) from Dr. Elena Vasquez, a Spanish medical researcher. Dr. Vasquez has discovered that several of her published academic papers, blog posts, and social media content were included in Lingua360's training dataset. She is requesting:

1. Confirmation that her personal data was used in training (Article 15(1)(a))

2. A copy of all her personal data in Lingua360's systems (Article 15(3))

3. Information about the purposes of processing and recipients of data (Article 15(1)(a-g))

4. Erasure of her data under Article 17 (right to be forgotten)

5. Correction of inaccurate outputs the model generates about her (Article 16)

Each request raises complex legal and technical questions that do not have settled answers.

Legal Analysis — Navigating Uncharted Territory

Article 15 — Right of access: Lingua360 can confirm that Dr. Vasquez's data was in the training set because the company maintains a URL-level index of training data sources. However, providing a "copy" of her data is complicated — her data was used to adjust model parameters during training and does not exist as a discrete, extractable record within the model itself. The Article 29 Working Party guidance suggests that the right of access applies to personal data in any form, but extracting specific training examples from a neural network remains technically challenging.

Article 17 — Right to erasure: Dr. Vasquez requests deletion of her data. Lingua360 can remove her data from the training dataset to prevent use in future training runs. However, the model has already been trained — her data's influence is embedded in the model weights. Full erasure would require machine unlearning, which is an active research area but not yet reliable at production scale. The alternative is retraining the model from scratch without her data, at a cost of approximately 4.2 million euros.

Article 16 — Right to rectification: Dr. Vasquez reports that the model generates inaccurate biographical information about her. Rectification of AI outputs is particularly challenging because the model generates text probabilistically — there is no "record" to correct. Techniques like RLHF fine-tuning or output filtering could reduce but not eliminate inaccurate outputs.

Legal basis for original processing: Lingua360 relies on Article 6(1)(f) — legitimate interest — as its legal basis for processing publicly available data for AI training. Following the Garante's ChatGPT decision and subsequent guidance, this basis requires a documented legitimate interest assessment (LIA) and a balancing test demonstrating that the company's interests do not override the data subject's rights and freedoms.

Knowledge Check

Dr. Vasquez requests erasure of her personal data under GDPR Article 17. Her data was used to train a neural network, and its influence is now embedded in the model weights. Which of the following most accurately describes Lingua360's obligation?

A

Lingua360 must retrain the model from scratch to fully erase her data

B

Lingua360 can refuse the request because AI training data is exempt from GDPR

C

Lingua360 has no obligation because the data was publicly available

D

Lingua360 must remove her data from the training dataset and take reasonable steps to prevent its use in future processing, while documenting why full erasure from model weights is technically disproportionate

Under GDPR, erasure must be implemented to the extent technically feasible. Removing data from the training set and preventing future use is a reasonable step. Full retraining may be deemed disproportionate, but this must be documented with a legitimate justification. There is no blanket exemption for AI training data or publicly available data under GDPR.

The Response Strategy

As DPO, you develop a comprehensive response within the 30-day GDPR deadline:

Immediate response (within 30 days):

- Confirm that Dr. Vasquez's data was included in the training dataset, providing the specific URLs and data categories

- Provide information about processing purposes, legal basis (legitimate interest), data retention periods, and data recipients

- Remove her data from the active training dataset and all data pipelines

- Document the technical limitations of extracting or erasing data from trained model weights, with supporting evidence from technical staff

- Implement output filtering to reduce generation of inaccurate information about Dr. Vasquez

- If the full request cannot be completed within 30 days, notify Dr. Vasquez of a 60-day extension under Article 12(3) with reasons

Systemic improvements:

- Build a data subject search tool that allows individuals to check if their data is in the training set

- Implement a web crawler opt-out mechanism (robots.txt compliance and a dedicated exclusion registry)

- Document the legitimate interest assessment for web scraping, including the balancing test

- Establish a machine unlearning research initiative to develop viable erasure capabilities

Knowledge Check

Lingua360 relies on legitimate interest (Article 6(1)(f)) as its legal basis for processing publicly available personal data for AI training. Which of the following is REQUIRED to support this legal basis?

A

Approval from the relevant supervisory authority before processing begins

B

A documented legitimate interest assessment including a balancing test

C

An anonymization certificate from an accredited third party

D

Consent from each data subject whose data appears in the training set

Legitimate interest under Article 6(1)(f) requires a three-part test: (1) identifying the legitimate interest pursued, (2) demonstrating the processing is necessary to achieve that interest, and (3) balancing the interest against the data subject's rights and freedoms. This must be documented. Consent is a separate legal basis. Prior supervisory authority approval is not required for legitimate interest processing.

Knowledge Check

Under GDPR, Lingua360 must respond to Dr. Vasquez's Subject Access Request within:

A

14 days, with no possibility of extension

B

90 days for requests involving AI systems

C

30 days, with a possible 60-day extension for complex requests

D

60 days, with a possible 30-day extension

Article 12(3) of the GDPR requires controllers to respond to data subject requests within one month (approximately 30 days). This can be extended by two further months (60 days) where necessary, taking into account the complexity and number of requests. The controller must inform the data subject of the extension within the initial one-month period. There is no special timeline for AI-related requests.

🤖

ChatGPT

online

A data subject has requested erasure of their personal data from our AI training dataset under GDPR Article 17. We can remove them from the dataset, but the model has already been trained and their data influenced the weights. Retraining costs 4.2 million euros. How do we handle this legally?

This is one of the hardest unsolved questions in AI governance, but there is a defensible path forward: **Step 1: Fulfill what is technically feasible** - Remove the data subject's data from the training dataset immediately - Ensure it is excluded from all future training runs and data pipelines - Document these actions thoroughly **Step 2: Address the model weights issue** - Document why full erasure from trained model weights is technically infeasible or disproportionate - Reference Recital 65 of GDPR, which acknowledges that erasure may be limited by technical constraints - The key legal concept is "reasonable steps" under Article 17(2) — you must take reasonable steps considering available technology and cost of implementation **Step 3: Build your proportionality argument** - EUR 4.2M retraining cost vs. the marginal privacy impact of one person's data diffused across billions of parameters - The data subject's information is not discretely stored or retrievable from the model - Include technical evidence that the model cannot reproduce the specific training data **Step 4: Mitigate ongoing impact** - Implement output filtering to prevent generation of personal information about this individual - Monitor model outputs for any content related to the data subject **Step 5: Prepare for regulatory scrutiny** - Document everything in your ROPA (Record of Processing Activities) - The EDPB has not issued definitive guidance on machine unlearning obligations - Be prepared to demonstrate good faith compliance efforts if a supervisory authority inquires **Important caveat:** This area is evolving rapidly. Monitor guidance from the EDPB, the ICO (UK), and national DPAs. The Italian Garante's OpenAI decision suggested that technical difficulty alone may not be sufficient — you need to show proactive mitigation.

↻ Replay conversation

★

Your opinion matters!

How would you rate this lesson?

Not my thing Loved it!

What could be better?

✨

Thanks for your feedback!

It helps us make better lessons.

📋

Scenario 3 Complete

"GDPR data subject rights apply to AI training data, but fulfillment is complicated by the technical reality that personal data becomes diffused across model weights during training. A defensible approach combines feasible compliance steps with documented proportionality arguments — but this area of law is rapidly evolving."

Tomorrow — Day 4

Building an AI Risk Framework with NIST AI RMF

A financial services company implements the NIST AI Risk Management Framework from scratch across 15 AI systems, navigating organizational resistance and technical complexity.

Continue to Day 4 →