Navigated to blog › ai-data-processing-privacy-gdpr-hipaa
Back to Blog
csv-guides

AI Privacy in Data Processing: GDPR, HIPAA, and EU AI Act for Data Teams

March 18, 2026
14
By SplitForge Team

Quick Answer

AI systems that use personal data — for training, fine-tuning, scoring, or making decisions — are regulated by GDPR, potentially HIPAA, and the EU AI Act from August 2, 2026. GDPR requires a lawful basis for using personal data in AI training and a DPIA for high-risk AI. HIPAA requires a Business Associate Agreement before any PHI reaches an AI vendor's servers. The EU AI Act imposes data governance requirements on training datasets for high-risk systems, with full enforcement of Annex III requirements starting August 2, 2026. A CSV file used as AI training data carries all three sets of obligations simultaneously.

If you ignore this:

  • Training on EU customer data without a documented lawful basis → GDPR violation, enforceable now
  • Uploading PHI to an ML platform without a signed BAA → HIPAA violation, enforceable now
  • Deploying a hiring, credit, or healthcare AI without Annex III data governance documentation → EU AI Act high-risk breach, enforceable August 2, 2026

These are not hypothetical risks. They are the exact scenarios regulators are investigating in 2026.


Fast Fix (2 Minutes)

If you're preparing a CSV dataset for AI model training or fine-tuning right now:

  1. Does the CSV contain personal data? If yes, GDPR applies. Lawful basis required. DPIA may be required if the AI makes automated decisions about individuals.
  2. Does the CSV contain PHI? If yes, HIPAA applies. The AI vendor's servers receiving PHI = Business Associate. BAA required before any file transfer.
  3. Will the AI system make decisions about employment, credit, healthcare, or education? If yes, it may be a high-risk AI system under EU AI Act Annex III. Additional data governance requirements apply from August 2, 2026.
  4. Anonymize or pseudonymize before training wherever possible — genuinely anonymous data falls outside GDPR scope, reducing obligations for the training activity.
  5. Use SplitForge Data Masking to mask identifiers locally before any dataset is transferred to an AI vendor or training platform.

TL;DR: The CSV file sitting in your data warehouse that feeds your customer churn model, HR scoring tool, or credit decision system isn't just a data file — it's simultaneously a GDPR processing activity, potentially a HIPAA disclosure, and from August 2, 2026, may be subject to EU AI Act data governance requirements. Most AI projects treat privacy as a deployment problem. It's actually a data collection problem that starts with the training CSV.


The central argument of this post: A single CSV file used for AI training doesn't trigger one regulatory framework. It triggers three simultaneously — and the obligations don't cancel each other out, they stack.

Most AI compliance content covers one framework at a time: GDPR for EU teams, HIPAA for healthcare, EU AI Act for product teams building regulated systems. The actual compliance problem is that the same training dataset routinely triggers all three. An HR scoring model trained on employee data is a GDPR processing activity, may touch HIPAA if the employer is in healthcare, and is a high-risk AI system under EU AI Act Annex III. Each framework has different timelines, different documentation requirements, and different enforcement bodies. None of them care that you were only thinking about one.

Most AI compliance discussions focus on model outputs — transparency, explainability, bias. Far fewer focus on the training data — where most privacy violations actually originate.

The CSV of customer transactions you used to train your churn model was a GDPR processing activity. The 500,000-row CSV of patient records used to fine-tune a clinical decision tool was a HIPAA disclosure. The HR screening dataset used to build a hiring propensity model may be a high-risk AI system under EU AI Act Annex III.

The regulatory frameworks don't distinguish between "the model" and "the data that trained it." They regulate the data processing. And data processing starts with the CSV.

Framework Overlap Matrix — required actions by scenario:

ScenarioGDPRHIPAAEU AI Act
Training data contains EU customer personal dataLawful basis required (Art 6). DPA with AI platform. DPIA if high-risk.Not triggered (no PHI)If Annex III use case: Art 10 data governance documentation required (Aug 2, 2026)
Training data contains patient records or clinical dataGDPR applies to EU patients. Lawful basis + DPA required.BAA with AI platform required before file transfer. De-identify or get BAA.If clinical decision support AI: likely Annex III high-risk. Art 10 + conformity assessment.
HR screening or recruitment AI using employee dataLawful basis required. Likely Art 22 automated decision rules apply. DPIA required.Not triggeredRecruitment AI = Annex III high-risk (Art 6(1)(a)). Art 10 + human oversight + conformity assessment.
Credit scoring or financial decision AILawful basis required. Art 22: automated decision rights apply.Not triggeredCredit/insurance AI = Annex III high-risk (Art 6(1)(b)). Full high-risk obligations.
Training data is genuinely anonymous (Recital 26 standard met)GDPR does not apply to training stepNot triggered (no PHI)EU AI Act Art 10 data governance still applies to training data quality — anonymization reduces scope but doesn't eliminate it
US-hosted AI platform processing EU dataSCCs + Transfer Impact Assessment required (Chapter V)BAA required if PHINot a separate EU AI Act requirement — but GDPR transfer mechanism is still needed

Read this table across rows. Each scenario triggers a different combination of requirements. Most AI projects don't map this before building — which is why enforcement is accelerating.

Each framework in this post was assessed against GDPR Articles 5, 22, and 25; HIPAA 45 CFR §§164.502(e) and 164.504(e); and EU AI Act Annex III and Article 10 requirements, March 2026.


Table of Contents


The Three Frameworks and How They Overlap for AI

ScenarioGDPRHIPAAEU AI Act
CSV of EU customer transactions used to train a churn modelMaybe (if model makes automated decisions)
CSV of patient records used to fine-tune a clinical AI (healthcare AI = Annex III category)
CSV of employee performance data used for HR scoring (employment decisions = Annex III)
CSV of loan applicants used to train a credit model (financial services decisions = Annex III)
Genuinely anonymous aggregated sales data for analytics modelDepends on model use

The rule: Framework overlap is the default for AI using personal data in regulated sectors. Treating any one framework as the only obligation is the most common compliance gap.

Obligation Overlap Decision Tree — run this before any dataset enters an AI pipeline:

Does the dataset contain personal data?
│
├── NO (genuinely anonymous) → No GDPR/HIPAA obligations for training step.
│                              EU AI Act Art 10 data governance may still apply.
│
└── YES → Does it contain Protected Health Information (PHI)?
           │
           ├── YES → HIPAA applies.
           │         Is the AI vendor's platform confirmed with a signed BAA?
           │         ├── YES → Proceed with GDPR checks below.
           │         └── NO → Apply Safe Harbor de-identification first.
           │                   (Remove 18 identifiers → PHI becomes non-PHI)
           │
           └── NO → GDPR applies. Continue.
                      │
                      Will the AI model make automated decisions affecting EU individuals
                      in: employment, credit, healthcare, education, or law enforcement?
                      │
                      ├── YES → EU AI Act Annex III high-risk classification likely.
                      │         Required: DPIA (GDPR Art 35) + Art 10 data governance
                      │         documentation + DPA with AI platform + SCCs if US-hosted.
                      │         Deadline: August 2, 2026.
                      │
                      └── NO → Standard GDPR training data obligations:
                                - Document lawful basis (Art 6)
                                - DPIA if processing is high-risk (Art 35)
                                - DPA with AI platform (Art 28)
                                - SCCs + TIA if US-hosted (Chapter V)
                                - Minimize training data to what's necessary (Art 5(1)(c))

ML engineers: save this. Run through it before every new training dataset is prepared.


GDPR and AI Training Data

GDPR applies to the processing of personal data regardless of the purpose — including using personal data as AI training data.

Lawful basis for training data: Using customer transaction data to train a churn model is a processing activity. It requires a lawful basis under Article 6. The most common bases used:

  • Legitimate interest (Article 6(1)(f)): Requires a balancing test. Using customer data to improve internal models may be justifiable — but customers would reasonably expect this only if it's disclosed in the privacy notice.
  • Consent (Article 6(1)(a)): Rarely appropriate for bulk training data — consent must be specific, and revoking consent would require removing that individual's contribution from the model (technically complex).
  • Contract (Article 6(1)(b)): Only where training the model is necessary to perform a contract with that individual. Narrow scope.

DPIA requirement: GDPR Article 35 requires a Data Protection Impact Assessment before processing that is likely to result in high risk. Automated decision-making with significant effects on individuals — hiring, credit, healthcare triage — almost always triggers this requirement.

❌ COMMON MISTAKE — using production data for AI training without DPIA:
Data team exports 2M customer records for ML training:
customer_id,age,location,purchase_history,churn_risk_label,support_tickets

Training data goes to: cloud ML platform (US-based, no DPA confirmed)

Problems:
- No documented lawful basis for using customer data in ML training
- No DPIA despite automated churn decisions affecting service offers
- Cloud upload = cross-border transfer without SCCs or TIA
- No privacy notice disclosed AI training use of data

COMPLIANT APPROACH:
1. Document lawful basis (legitimate interest + balancing test)
2. Run DPIA if model makes automated decisions affecting individuals
3. Pseudonymize or anonymize before training where feasible
4. Confirm DPA + SCCs with ML platform before uploading
5. Update privacy notice to disclose AI training use

Article 22 — Automated decision-making: If the AI model makes or substantially influences decisions that significantly affect individuals — credit decisions, hiring screening, personalized pricing — GDPR Article 22 applies. Individuals have the right to opt out, request human review, and receive an explanation. This applies to the deployed model, but its compliance starts with how the training data was handled.

What this means for your AI data workflow: Every CSV that becomes training data is a processing activity. The lawful basis question must be answered before the CSV reaches the training pipeline, not after the model is deployed.


HIPAA and AI: When a Training Dataset is a PHI Disclosure

Under HIPAA (45 CFR §§164.502(e) and 164.504(e)), any vendor whose servers receive Protected Health Information is generally a Business Associate — even if the vendor is an AI platform, even if the purpose is model training, and even if the vendor claims not to retain the data.

The HIPAA AI risk scenario:

A hospital exports patient outcome data in CSV format to fine-tune a clinical decision support model. The AI vendor's platform receives the CSV on its servers. The AI vendor is now a Business Associate. A signed BAA is required before this transfer. Most general-purpose ML platforms and LLM providers do not offer BAAs by default.

❌ PHI in AI training pipeline without BAA:
Hospital exports: patient_id, diagnosis_code, treatment_outcome,
                  age, gender, comorbidities, readmission_flag

Exports to: Commercial ML platform (AWS SageMaker, Google Vertex AI,
            or a commercial LLM fine-tuning service)

Commercial ML platforms have BAA processes — but only if:
- You've requested and signed a BAA specifically
- The BAA covers the AI training use case
- The data processing is within the BAA's scope

Uploading without a confirmed BAA = potential HIPAA violation.
The platform's general terms of service are not a BAA.

COMPLIANT:
1. Remove direct identifiers before training (Safe Harbor de-identification)
2. If identifiers needed: Expert Determination + BAA with platform
3. Confirm BAA covers the specific ML training use case in writing

HIPAA Safe Harbor de-identification for training data: HIPAA provides two methods for de-identifying data so it falls outside PHI scope:

  • Safe Harbor: Remove 18 specified identifiers (names, dates, geographic units smaller than state, SSN, etc.). The resulting data is no longer PHI.
  • Expert Determination: A qualified statistical expert certifies that the risk of identifying individuals is very small.

For AI training data, Safe Harbor de-identification is often feasible and eliminates the BAA requirement entirely. A CSV with diagnosis codes, age bands (not exact ages), and state-level geography — with all 18 identifiers removed — is not PHI.

What this means for your AI data workflow: Before any PHI reaches an AI vendor's platform, confirm a BAA is in place. If a BAA isn't available or the use case is unclear, de-identify to Safe Harbor standard first.


EU AI Act: August 2, 2026 and What It Requires for Training Data

The EU AI Act enters full enforcement for Annex III high-risk AI systems on August 2, 2026. Prohibited practices have been enforceable since February 2025. General-purpose AI model obligations apply since August 2025.

What is a high-risk AI system under Annex III?

AI systems in these categories are high-risk by default:

  • Biometric identification and categorization
  • Critical infrastructure management
  • Education and vocational training — determining access, admission, grading
  • Employment and workforce management — recruitment, selection, promotion, performance evaluation, termination
  • Essential private and public services — credit scoring, social benefits
  • Law enforcement — risk assessment, profiling
  • Migration and border control
  • Administration of justice

If your AI system makes or influences decisions in any of these categories affecting EU individuals, it is likely high-risk. Classification depends on system design and intended use context — borderline cases require legal review, as interpretation of Annex III categories continues to evolve. The penalties for confirmed non-compliance: up to €35 million or 7% of global annual turnover for the most serious violations; up to €15 million or 3% for non-compliance with high-risk obligations.

Article 10 — Data governance requirements for training data:

High-risk AI systems must comply with data governance requirements for training, validation, and testing datasets. Under Article 10, training data must be:

  • Relevant and representative for the intended purpose
  • Free of errors and complete to the best extent possible
  • Processed with appropriate data governance practices documented
  • Subject to examination for possible biases

The GDPR + EU AI Act overlap:

EU AI Act Recital 69 explicitly references GDPR data minimization for AI training data. GDPR Article 5(1)(c) data minimization and Article 25 privacy by design apply alongside EU AI Act Article 10. Both require that training data be limited to what is necessary — using a 10-million-row customer dataset when 100,000 rows would suffice is a minimization problem under both frameworks.

EU AI Act Article 10 + GDPR Article 5 minimization checklist
for AI training datasets:

Before using a CSV as training data:
□ Dataset is relevant to the model's intended purpose (Art 10)
□ Dataset is representative — no systematic bias in included populations (Art 10)
□ Personal data limited to what's necessary for the training purpose (GDPR Art 5)
□ Sensitive categories (health, ethnicity, etc.) handled with explicit basis (GDPR Art 9)
□ DPIA completed if model makes significant automated decisions (GDPR Art 35)
□ Technical documentation records data sources and governance (EU AI Act Art 11)
□ Data lineage documented — what data, from where, processed how (EU AI Act Art 12)

What this means for your AI data workflow: If your organization deploys AI in hiring, credit, healthcare, or education affecting EU individuals, August 2, 2026 is your compliance deadline for Annex III requirements. The training data documentation requirements need to start now — not when enforcement begins.


AI Data Lifecycle: Where Regulation Applies at Each Stage

Compliance starts before training and continues through deployment. This is the full map — most teams only address the training step.

AI DATA LIFECYCLE

STAGE 1: Collection
→ CSV export of customer / employee / patient data
→ GDPR: lawful basis required for collection purpose
→ HIPAA: PHI triggers BAA if cloud-stored
→ EU AI Act: data quality obligations begin here (Art 10)

STAGE 2: Preprocessing
→ Field selection, cleaning, normalization, masking
→ GDPR Art 5(1)(c): minimize to fields model actually uses
→ GDPR Art 25: privacy-by-design at preprocessing stage
→ Action: mask locally before any upload to training platform

STAGE 3: Training
→ CSV data reaches ML platform
→ GDPR: platform is a processor (DPA required)
→ HIPAA: platform may be a Business Associate (BAA required if PHI)
→ EU AI Act: training data documentation required (Art 10, 11, 12)
→ Cross-border: SCCs + TIA if US-hosted platform

STAGE 4: Model Deployment
→ Model makes predictions or decisions
→ GDPR Art 22: automated decisions affecting individuals require basis + opt-out
→ EU AI Act: high-risk systems require human oversight, conformity assessment
→ Transparency: Art 50 disclosure obligations (active August 2026)

STAGE 5: Monitoring
→ Ongoing model performance and bias tracking
→ EU AI Act: post-market monitoring required for high-risk systems
→ GDPR: purpose limitation — model must not drift beyond original use
→ Action: document bias checks and governance reviews

Most data teams are compliant at Stage 2 (preprocessing) and non-compliant at Stages 3 and 4. Fix Stages 3–4 before August 2026 if Annex III applies.


Step 1: Classify the AI system Is it making automated decisions in an Annex III category affecting EU individuals? If yes, high-risk obligations apply from August 2026. Document this classification.

Step 2: Identify the training data and its personal data content List every CSV file in the training pipeline. Identify which contain personal data, which contain PHI, and which contain special category data under GDPR Article 9.

Step 3: Confirm lawful basis for each dataset For each CSV with personal data: document the lawful basis. Run a DPIA if the model makes significant automated decisions. For PHI: confirm BAA with AI vendor before any transfer.

Step 4: Minimize and anonymize before training Remove or mask fields that aren't necessary for the training objective. Apply Safe Harbor de-identification to PHI where feasible. Pseudonymize customer identifiers. A training dataset should contain the minimum personal data needed to achieve the model's objective — not the maximum available.

Step 5: Confirm DPA and transfer mechanism with the AI platform The ML training platform is a data processor if it processes personal data on your behalf. DPA required. For EU data on US platforms: SCCs + Transfer Impact Assessment.

Step 6: Document data lineage EU AI Act Article 10 requires documentation of training data sources and governance. Maintain a record of: which datasets were used, what personal data they contained, what governance was applied, and what bias examination was conducted.

Many ML platforms and LLM APIs upload your training CSV to remote servers — often in the US — for processing. For files containing personal data, this is simultaneously a GDPR cross-border transfer, a potential HIPAA BAA event, and (for Annex III systems) an EU AI Act data governance event. SplitForge processes masking and anonymization locally before any file leaves the machine. The training platform only receives the pre-processed, minimized dataset.

For a complete overview of privacy frameworks, see our privacy-first data processing guide. For anonymization techniques to reduce training data regulatory scope, see our GDPR anonymization guide.


Anonymization as the Path Out

Genuinely anonymous training data — meeting the GDPR Recital 26 standard — falls outside GDPR scope entirely. If a training dataset can be made genuinely anonymous before use, the GDPR obligations for the training activity are materially reduced.

For tabular AI training data, this typically means:

  • Aggregating individual records into group statistics rather than training on individual rows
  • Applying k-anonymity across quasi-identifiers (age, location, job title combinations)
  • Removing all direct identifiers and limiting quasi-identifier precision

The practical limit: Many ML models require individual-level training data to learn meaningful patterns. A churn prediction model trained on aggregated data loses most of its signal. In these cases, pseudonymization (not anonymization) is the realistic option — reducing breach risk while GDPR obligations remain.

Synthetic data as an alternative: Generating synthetic data that preserves the statistical properties of real customer data — without containing real individuals' records — is a growing approach. Properly generated synthetic data may fall outside GDPR scope if it genuinely cannot be linked to real individuals. See our test data generation guide for the practical implementation.


Operator Rules: AI Data Privacy

Short. Non-negotiable. Reference before any personal data enters an AI training pipeline.

  • Training data is a processing activity — GDPR applies the moment you start
  • PHI sent to any AI vendor without a BAA is a potential HIPAA violation — regardless of the vendor's privacy claims
  • EU AI Act high-risk systems need documented data governance — build the records now, not at enforcement time
  • Minimize training data to what's necessary — 100K representative rows is almost always better than 10M noisy rows with full PII
  • DPIA first if the model makes automated decisions affecting individuals — this is not optional
  • The AI platform's terms of service are not a DPA — request one before uploading
  • August 2, 2026 is the Annex III deadline — if you're deploying high-risk AI in the EU, compliance needs to start now

Additional Resources

EU AI Act:

GDPR:

HIPAA:

Disclaimer: This post is for informational purposes only and does not constitute legal advice. AI system classification, training data obligations, and regulatory applicability depend on your specific systems, data types, and jurisdiction. Consult qualified legal counsel before drawing compliance conclusions.


FAQ

Yes. GDPR regulates the processing of personal data — including using it as training input. The lawful basis requirement and data minimization principles apply to the training activity itself, not just to model outputs. If personal data was processed to train the model, GDPR obligations applied to that processing regardless of whether the trained model's outputs contain personal data.

Using customer data for AI training under legitimate interest is possible, but the processing must be disclosed in your privacy notice (Articles 13/14 GDPR). Customers must be informed of the purposes of processing including AI/ML use. A legitimate interest basis that isn't disclosed to data subjects is not valid. Additionally, customers retain the right to object under Article 21.

No. A vendor being "GDPR compliant" as a description of their internal practices is different from having a signed Data Processing Agreement in place with you. Article 28 requires a contract between controller and processor that covers the specific obligations listed in GDPR. Request the DPA, read the scope (which processing activities it covers), and confirm it covers your AI training use case before uploading data.

The official deadline is August 2, 2026. The European Commission proposed a "Digital Omnibus" package in late 2025 that could potentially delay Annex III obligations for some systems. However, as of March 2026, this proposal has not been enacted. Prudent compliance planning treats August 2, 2026 as the binding deadline — do not assume the extension will materialize.

Yes. Employee data is personal data. Processing employee data for AI training is a processing activity requiring a lawful basis. The most common basis is legitimate interest — but this requires a balancing test and may be more difficult to justify for sensitive processing (performance evaluation, behavior monitoring) where employees have limited ability to object given the power imbalance. A DPIA is likely required for AI systems that make significant automated assessments of employees.


Process AI Training Data Without Expanding Your Compliance Footprint

Mask personal identifiers in training datasets locally before any upload to an ML platform
Apply Safe Harbor de-identification to PHI locally — eliminate the BAA requirement for the training step
Files never leave your browser during masking — the training platform only receives the pre-processed dataset
Handle million-row training datasets without uploading raw personal data to a cloud ML environment

Continue Reading

More guides to help you work smarter with your data

ai-data-prep

AI-Ready Data Checklist: 10 Things to Verify Before Upload (2026)

Before uploading to ChatGPT, Claude, or a fine-tuning API, run through this 10-point checklist. UTF-8 encoding, clean headers, PII removed, size within limits.

Read More
ai-data-prep

Convert Excel to JSON for AI APIs and LLM Pipelines (2026)

AI APIs and LLM pipelines expect JSON, not spreadsheets. Fine-tuning needs JSONL; direct prompts take arrays. Convert locally — no upload, no conversion server.

Read More
ai-data-prep

Prepare Data for AI: The Complete Guide (Privacy-First, 2026)

How to prepare a CSV or Excel file for ChatGPT, Claude, or an AI API — encoding, PII, format, size, and privacy. The complete local-first prep workflow.

Read More