Quick Answer
AI systems that use personal data — for training, fine-tuning, scoring, or making decisions — are regulated by GDPR, potentially HIPAA, and the EU AI Act from August 2, 2026. GDPR requires a lawful basis for using personal data in AI training and a DPIA for high-risk AI. HIPAA requires a Business Associate Agreement before any PHI reaches an AI vendor's servers. The EU AI Act imposes data governance requirements on training datasets for high-risk systems, with full enforcement of Annex III requirements starting August 2, 2026. A CSV file used as AI training data carries all three sets of obligations simultaneously.
If you ignore this:
- Training on EU customer data without a documented lawful basis → GDPR violation, enforceable now
- Uploading PHI to an ML platform without a signed BAA → HIPAA violation, enforceable now
- Deploying a hiring, credit, or healthcare AI without Annex III data governance documentation → EU AI Act high-risk breach, enforceable August 2, 2026
These are not hypothetical risks. They are the exact scenarios regulators are investigating in 2026.
Fast Fix (2 Minutes)
If you're preparing a CSV dataset for AI model training or fine-tuning right now:
- Does the CSV contain personal data? If yes, GDPR applies. Lawful basis required. DPIA may be required if the AI makes automated decisions about individuals.
- Does the CSV contain PHI? If yes, HIPAA applies. The AI vendor's servers receiving PHI = Business Associate. BAA required before any file transfer.
- Will the AI system make decisions about employment, credit, healthcare, or education? If yes, it may be a high-risk AI system under EU AI Act Annex III. Additional data governance requirements apply from August 2, 2026.
- Anonymize or pseudonymize before training wherever possible — genuinely anonymous data falls outside GDPR scope, reducing obligations for the training activity.
- Use SplitForge Data Masking to mask identifiers locally before any dataset is transferred to an AI vendor or training platform.
TL;DR: The CSV file sitting in your data warehouse that feeds your customer churn model, HR scoring tool, or credit decision system isn't just a data file — it's simultaneously a GDPR processing activity, potentially a HIPAA disclosure, and from August 2, 2026, may be subject to EU AI Act data governance requirements. Most AI projects treat privacy as a deployment problem. It's actually a data collection problem that starts with the training CSV.
The central argument of this post: A single CSV file used for AI training doesn't trigger one regulatory framework. It triggers three simultaneously — and the obligations don't cancel each other out, they stack.
Most AI compliance content covers one framework at a time: GDPR for EU teams, HIPAA for healthcare, EU AI Act for product teams building regulated systems. The actual compliance problem is that the same training dataset routinely triggers all three. An HR scoring model trained on employee data is a GDPR processing activity, may touch HIPAA if the employer is in healthcare, and is a high-risk AI system under EU AI Act Annex III. Each framework has different timelines, different documentation requirements, and different enforcement bodies. None of them care that you were only thinking about one.
Most AI compliance discussions focus on model outputs — transparency, explainability, bias. Far fewer focus on the training data — where most privacy violations actually originate.
The CSV of customer transactions you used to train your churn model was a GDPR processing activity. The 500,000-row CSV of patient records used to fine-tune a clinical decision tool was a HIPAA disclosure. The HR screening dataset used to build a hiring propensity model may be a high-risk AI system under EU AI Act Annex III.
The regulatory frameworks don't distinguish between "the model" and "the data that trained it." They regulate the data processing. And data processing starts with the CSV.
Framework Overlap Matrix — required actions by scenario:
| Scenario | GDPR | HIPAA | EU AI Act |
|---|---|---|---|
| Training data contains EU customer personal data | Lawful basis required (Art 6). DPA with AI platform. DPIA if high-risk. | Not triggered (no PHI) | If Annex III use case: Art 10 data governance documentation required (Aug 2, 2026) |
| Training data contains patient records or clinical data | GDPR applies to EU patients. Lawful basis + DPA required. | BAA with AI platform required before file transfer. De-identify or get BAA. | If clinical decision support AI: likely Annex III high-risk. Art 10 + conformity assessment. |
| HR screening or recruitment AI using employee data | Lawful basis required. Likely Art 22 automated decision rules apply. DPIA required. | Not triggered | Recruitment AI = Annex III high-risk (Art 6(1)(a)). Art 10 + human oversight + conformity assessment. |
| Credit scoring or financial decision AI | Lawful basis required. Art 22: automated decision rights apply. | Not triggered | Credit/insurance AI = Annex III high-risk (Art 6(1)(b)). Full high-risk obligations. |
| Training data is genuinely anonymous (Recital 26 standard met) | GDPR does not apply to training step | Not triggered (no PHI) | EU AI Act Art 10 data governance still applies to training data quality — anonymization reduces scope but doesn't eliminate it |
| US-hosted AI platform processing EU data | SCCs + Transfer Impact Assessment required (Chapter V) | BAA required if PHI | Not a separate EU AI Act requirement — but GDPR transfer mechanism is still needed |
Read this table across rows. Each scenario triggers a different combination of requirements. Most AI projects don't map this before building — which is why enforcement is accelerating.
Each framework in this post was assessed against GDPR Articles 5, 22, and 25; HIPAA 45 CFR §§164.502(e) and 164.504(e); and EU AI Act Annex III and Article 10 requirements, March 2026.
Table of Contents
- The Three Frameworks and How They Overlap for AI
- GDPR and AI Training Data
- HIPAA and AI: When a Training Dataset is a PHI Disclosure
- EU AI Act: August 2, 2026 and What It Requires for Training Data
- Practical Workflow: Preparing AI Training Data Compliantly
- Anonymization as the Path Out
- Operator Rules: AI Data Privacy
- Additional Resources
- FAQ
The Three Frameworks and How They Overlap for AI
| Scenario | GDPR | HIPAA | EU AI Act |
|---|---|---|---|
| CSV of EU customer transactions used to train a churn model | Maybe (if model makes automated decisions) | ||
| CSV of patient records used to fine-tune a clinical AI | (healthcare AI = Annex III category) | ||
| CSV of employee performance data used for HR scoring | (employment decisions = Annex III) | ||
| CSV of loan applicants used to train a credit model | (financial services decisions = Annex III) | ||
| Genuinely anonymous aggregated sales data for analytics model | Depends on model use |
The rule: Framework overlap is the default for AI using personal data in regulated sectors. Treating any one framework as the only obligation is the most common compliance gap.
Obligation Overlap Decision Tree — run this before any dataset enters an AI pipeline:
Does the dataset contain personal data?
│
├── NO (genuinely anonymous) → No GDPR/HIPAA obligations for training step.
│ EU AI Act Art 10 data governance may still apply.
│
└── YES → Does it contain Protected Health Information (PHI)?
│
├── YES → HIPAA applies.
│ Is the AI vendor's platform confirmed with a signed BAA?
│ ├── YES → Proceed with GDPR checks below.
│ └── NO → Apply Safe Harbor de-identification first.
│ (Remove 18 identifiers → PHI becomes non-PHI)
│
└── NO → GDPR applies. Continue.
│
Will the AI model make automated decisions affecting EU individuals
in: employment, credit, healthcare, education, or law enforcement?
│
├── YES → EU AI Act Annex III high-risk classification likely.
│ Required: DPIA (GDPR Art 35) + Art 10 data governance
│ documentation + DPA with AI platform + SCCs if US-hosted.
│ Deadline: August 2, 2026.
│
└── NO → Standard GDPR training data obligations:
- Document lawful basis (Art 6)
- DPIA if processing is high-risk (Art 35)
- DPA with AI platform (Art 28)
- SCCs + TIA if US-hosted (Chapter V)
- Minimize training data to what's necessary (Art 5(1)(c))
ML engineers: save this. Run through it before every new training dataset is prepared.
GDPR and AI Training Data
GDPR applies to the processing of personal data regardless of the purpose — including using personal data as AI training data.
Lawful basis for training data: Using customer transaction data to train a churn model is a processing activity. It requires a lawful basis under Article 6. The most common bases used:
- Legitimate interest (Article 6(1)(f)): Requires a balancing test. Using customer data to improve internal models may be justifiable — but customers would reasonably expect this only if it's disclosed in the privacy notice.
- Consent (Article 6(1)(a)): Rarely appropriate for bulk training data — consent must be specific, and revoking consent would require removing that individual's contribution from the model (technically complex).
- Contract (Article 6(1)(b)): Only where training the model is necessary to perform a contract with that individual. Narrow scope.
DPIA requirement: GDPR Article 35 requires a Data Protection Impact Assessment before processing that is likely to result in high risk. Automated decision-making with significant effects on individuals — hiring, credit, healthcare triage — almost always triggers this requirement.
❌ COMMON MISTAKE — using production data for AI training without DPIA:
Data team exports 2M customer records for ML training:
customer_id,age,location,purchase_history,churn_risk_label,support_tickets
Training data goes to: cloud ML platform (US-based, no DPA confirmed)
Problems:
- No documented lawful basis for using customer data in ML training
- No DPIA despite automated churn decisions affecting service offers
- Cloud upload = cross-border transfer without SCCs or TIA
- No privacy notice disclosed AI training use of data
COMPLIANT APPROACH:
1. Document lawful basis (legitimate interest + balancing test)
2. Run DPIA if model makes automated decisions affecting individuals
3. Pseudonymize or anonymize before training where feasible
4. Confirm DPA + SCCs with ML platform before uploading
5. Update privacy notice to disclose AI training use
Article 22 — Automated decision-making: If the AI model makes or substantially influences decisions that significantly affect individuals — credit decisions, hiring screening, personalized pricing — GDPR Article 22 applies. Individuals have the right to opt out, request human review, and receive an explanation. This applies to the deployed model, but its compliance starts with how the training data was handled.
What this means for your AI data workflow: Every CSV that becomes training data is a processing activity. The lawful basis question must be answered before the CSV reaches the training pipeline, not after the model is deployed.
HIPAA and AI: When a Training Dataset is a PHI Disclosure
Under HIPAA (45 CFR §§164.502(e) and 164.504(e)), any vendor whose servers receive Protected Health Information is generally a Business Associate — even if the vendor is an AI platform, even if the purpose is model training, and even if the vendor claims not to retain the data.
The HIPAA AI risk scenario:
A hospital exports patient outcome data in CSV format to fine-tune a clinical decision support model. The AI vendor's platform receives the CSV on its servers. The AI vendor is now a Business Associate. A signed BAA is required before this transfer. Most general-purpose ML platforms and LLM providers do not offer BAAs by default.
❌ PHI in AI training pipeline without BAA:
Hospital exports: patient_id, diagnosis_code, treatment_outcome,
age, gender, comorbidities, readmission_flag
Exports to: Commercial ML platform (AWS SageMaker, Google Vertex AI,
or a commercial LLM fine-tuning service)
Commercial ML platforms have BAA processes — but only if:
- You've requested and signed a BAA specifically
- The BAA covers the AI training use case
- The data processing is within the BAA's scope
Uploading without a confirmed BAA = potential HIPAA violation.
The platform's general terms of service are not a BAA.
COMPLIANT:
1. Remove direct identifiers before training (Safe Harbor de-identification)
2. If identifiers needed: Expert Determination + BAA with platform
3. Confirm BAA covers the specific ML training use case in writing
HIPAA Safe Harbor de-identification for training data: HIPAA provides two methods for de-identifying data so it falls outside PHI scope:
- Safe Harbor: Remove 18 specified identifiers (names, dates, geographic units smaller than state, SSN, etc.). The resulting data is no longer PHI.
- Expert Determination: A qualified statistical expert certifies that the risk of identifying individuals is very small.
For AI training data, Safe Harbor de-identification is often feasible and eliminates the BAA requirement entirely. A CSV with diagnosis codes, age bands (not exact ages), and state-level geography — with all 18 identifiers removed — is not PHI.
What this means for your AI data workflow: Before any PHI reaches an AI vendor's platform, confirm a BAA is in place. If a BAA isn't available or the use case is unclear, de-identify to Safe Harbor standard first.
EU AI Act: August 2, 2026 and What It Requires for Training Data
The EU AI Act enters full enforcement for Annex III high-risk AI systems on August 2, 2026. Prohibited practices have been enforceable since February 2025. General-purpose AI model obligations apply since August 2025.
What is a high-risk AI system under Annex III?
AI systems in these categories are high-risk by default:
- Biometric identification and categorization
- Critical infrastructure management
- Education and vocational training — determining access, admission, grading
- Employment and workforce management — recruitment, selection, promotion, performance evaluation, termination
- Essential private and public services — credit scoring, social benefits
- Law enforcement — risk assessment, profiling
- Migration and border control
- Administration of justice
If your AI system makes or influences decisions in any of these categories affecting EU individuals, it is likely high-risk. Classification depends on system design and intended use context — borderline cases require legal review, as interpretation of Annex III categories continues to evolve. The penalties for confirmed non-compliance: up to €35 million or 7% of global annual turnover for the most serious violations; up to €15 million or 3% for non-compliance with high-risk obligations.
Article 10 — Data governance requirements for training data:
High-risk AI systems must comply with data governance requirements for training, validation, and testing datasets. Under Article 10, training data must be:
- Relevant and representative for the intended purpose
- Free of errors and complete to the best extent possible
- Processed with appropriate data governance practices documented
- Subject to examination for possible biases
The GDPR + EU AI Act overlap:
EU AI Act Recital 69 explicitly references GDPR data minimization for AI training data. GDPR Article 5(1)(c) data minimization and Article 25 privacy by design apply alongside EU AI Act Article 10. Both require that training data be limited to what is necessary — using a 10-million-row customer dataset when 100,000 rows would suffice is a minimization problem under both frameworks.
EU AI Act Article 10 + GDPR Article 5 minimization checklist
for AI training datasets:
Before using a CSV as training data:
□ Dataset is relevant to the model's intended purpose (Art 10)
□ Dataset is representative — no systematic bias in included populations (Art 10)
□ Personal data limited to what's necessary for the training purpose (GDPR Art 5)
□ Sensitive categories (health, ethnicity, etc.) handled with explicit basis (GDPR Art 9)
□ DPIA completed if model makes significant automated decisions (GDPR Art 35)
□ Technical documentation records data sources and governance (EU AI Act Art 11)
□ Data lineage documented — what data, from where, processed how (EU AI Act Art 12)
What this means for your AI data workflow: If your organization deploys AI in hiring, credit, healthcare, or education affecting EU individuals, August 2, 2026 is your compliance deadline for Annex III requirements. The training data documentation requirements need to start now — not when enforcement begins.
AI Data Lifecycle: Where Regulation Applies at Each Stage
Compliance starts before training and continues through deployment. This is the full map — most teams only address the training step.
AI DATA LIFECYCLE
STAGE 1: Collection
→ CSV export of customer / employee / patient data
→ GDPR: lawful basis required for collection purpose
→ HIPAA: PHI triggers BAA if cloud-stored
→ EU AI Act: data quality obligations begin here (Art 10)
STAGE 2: Preprocessing
→ Field selection, cleaning, normalization, masking
→ GDPR Art 5(1)(c): minimize to fields model actually uses
→ GDPR Art 25: privacy-by-design at preprocessing stage
→ Action: mask locally before any upload to training platform
STAGE 3: Training
→ CSV data reaches ML platform
→ GDPR: platform is a processor (DPA required)
→ HIPAA: platform may be a Business Associate (BAA required if PHI)
→ EU AI Act: training data documentation required (Art 10, 11, 12)
→ Cross-border: SCCs + TIA if US-hosted platform
STAGE 4: Model Deployment
→ Model makes predictions or decisions
→ GDPR Art 22: automated decisions affecting individuals require basis + opt-out
→ EU AI Act: high-risk systems require human oversight, conformity assessment
→ Transparency: Art 50 disclosure obligations (active August 2026)
STAGE 5: Monitoring
→ Ongoing model performance and bias tracking
→ EU AI Act: post-market monitoring required for high-risk systems
→ GDPR: purpose limitation — model must not drift beyond original use
→ Action: document bias checks and governance reviews
Most data teams are compliant at Stage 2 (preprocessing) and non-compliant at Stages 3 and 4. Fix Stages 3–4 before August 2026 if Annex III applies.
Step 1: Classify the AI system Is it making automated decisions in an Annex III category affecting EU individuals? If yes, high-risk obligations apply from August 2026. Document this classification.
Step 2: Identify the training data and its personal data content List every CSV file in the training pipeline. Identify which contain personal data, which contain PHI, and which contain special category data under GDPR Article 9.
Step 3: Confirm lawful basis for each dataset For each CSV with personal data: document the lawful basis. Run a DPIA if the model makes significant automated decisions. For PHI: confirm BAA with AI vendor before any transfer.
Step 4: Minimize and anonymize before training Remove or mask fields that aren't necessary for the training objective. Apply Safe Harbor de-identification to PHI where feasible. Pseudonymize customer identifiers. A training dataset should contain the minimum personal data needed to achieve the model's objective — not the maximum available.
Step 5: Confirm DPA and transfer mechanism with the AI platform The ML training platform is a data processor if it processes personal data on your behalf. DPA required. For EU data on US platforms: SCCs + Transfer Impact Assessment.
Step 6: Document data lineage EU AI Act Article 10 requires documentation of training data sources and governance. Maintain a record of: which datasets were used, what personal data they contained, what governance was applied, and what bias examination was conducted.
Many ML platforms and LLM APIs upload your training CSV to remote servers — often in the US — for processing. For files containing personal data, this is simultaneously a GDPR cross-border transfer, a potential HIPAA BAA event, and (for Annex III systems) an EU AI Act data governance event. SplitForge processes masking and anonymization locally before any file leaves the machine. The training platform only receives the pre-processed, minimized dataset.
For a complete overview of privacy frameworks, see our privacy-first data processing guide. For anonymization techniques to reduce training data regulatory scope, see our GDPR anonymization guide.
Anonymization as the Path Out
Genuinely anonymous training data — meeting the GDPR Recital 26 standard — falls outside GDPR scope entirely. If a training dataset can be made genuinely anonymous before use, the GDPR obligations for the training activity are materially reduced.
For tabular AI training data, this typically means:
- Aggregating individual records into group statistics rather than training on individual rows
- Applying k-anonymity across quasi-identifiers (age, location, job title combinations)
- Removing all direct identifiers and limiting quasi-identifier precision
The practical limit: Many ML models require individual-level training data to learn meaningful patterns. A churn prediction model trained on aggregated data loses most of its signal. In these cases, pseudonymization (not anonymization) is the realistic option — reducing breach risk while GDPR obligations remain.
Synthetic data as an alternative: Generating synthetic data that preserves the statistical properties of real customer data — without containing real individuals' records — is a growing approach. Properly generated synthetic data may fall outside GDPR scope if it genuinely cannot be linked to real individuals. See our test data generation guide for the practical implementation.
Operator Rules: AI Data Privacy
Short. Non-negotiable. Reference before any personal data enters an AI training pipeline.
- Training data is a processing activity — GDPR applies the moment you start
- PHI sent to any AI vendor without a BAA is a potential HIPAA violation — regardless of the vendor's privacy claims
- EU AI Act high-risk systems need documented data governance — build the records now, not at enforcement time
- Minimize training data to what's necessary — 100K representative rows is almost always better than 10M noisy rows with full PII
- DPIA first if the model makes automated decisions affecting individuals — this is not optional
- The AI platform's terms of service are not a DPA — request one before uploading
- August 2, 2026 is the Annex III deadline — if you're deploying high-risk AI in the EU, compliance needs to start now
Additional Resources
EU AI Act:
- EU AI Act — Annex III: High-Risk AI Systems — Full list of high-risk categories
- EU AI Act — Article 10: Data Governance — Training data requirements for high-risk systems
- EU AI Act Implementation Timeline — Official deadline reference
GDPR:
- GDPR Article 22 — Automated Decision-Making — Individual rights against automated decisions
- GDPR Article 35 — DPIA — When DPIAs are required
HIPAA:
- HHS: De-identification of PHI — Safe Harbor and Expert Determination methods
- 45 CFR §164.502(e) — Business Associate disclosure requirements
Disclaimer: This post is for informational purposes only and does not constitute legal advice. AI system classification, training data obligations, and regulatory applicability depend on your specific systems, data types, and jurisdiction. Consult qualified legal counsel before drawing compliance conclusions.