Navigated to blog › pii-masking-techniques-csv
Back to Blog
csv-operations

PII Masking for CSV Files: 5 Techniques That Work Without Installation

March 16, 2026
14
By SplitForge Team

Quick Answer

What are the main PII masking techniques for CSV files?

Five techniques cover virtually every use case: redaction (remove the value entirely), masking (replace characters with asterisks, preserving structure), pseudonymization (replace real values with consistent fake values), hashing (replace values with one-way cryptographic digests), and tokenization (replace values with reversible tokens linked to a key table). Each technique produces a different result for re-identification risk, data utility, and GDPR status.

Critical distinction: All five of these techniques typically produce pseudonymized data — still regulated under GDPR. Only genuine anonymization, which meets the Recital 26 standard, removes GDPR obligations entirely. See our pseudonymization vs. anonymization guide for the legal details.


TL;DR: PII masking in CSV files doesn't require enterprise software. Redaction, masking, pseudonymization, hashing, and tokenization each serve different purposes and produce different levels of re-identification protection. The right technique depends on whether you need to re-link data later and who will receive the processed file. SplitForge Data Masking applies all five techniques locally in your browser — the file never leaves your device.


Your marketing team needs to share customer data with an external analytics firm. Your HR department needs to send payroll records to a benefits administrator. Your development team needs realistic test data without using production records.

In every case, the same problem: how do you reduce the personal data in a CSV to the minimum necessary for the task, without uploading your files to yet another cloud service and without spending a week configuring enterprise software?

Enterprise PII masking tools — IRI FieldShield, SAS Studio Data Masking, Strac — are powerful. They are also designed for organizations with dedicated data engineering teams, IT procurement budgets, and weeks to implement. They require either a local installation or a cloud upload of the exact data you are trying to protect.

There is a simpler path for the CSV use case. Five techniques, applied in minutes, cover the practical range of PII masking needs. Each technique was tested against synthetic CSV files containing names, email addresses, dates of birth, national ID numbers, and financial account data using SplitForge Data Masking tool, March 2026.


Table of Contents


This guide is for: Data analysts, compliance teams, marketing operations, and HR teams who need to mask PII in CSV files before sharing, testing, or external processing — without enterprise tooling.


The PII Masking Decision Tree

Use this tree to identify the right technique before you start. The key questions are: do you need the data to be usable after masking, and do you need to re-link it to the original later?

Five Techniques at a Glance

Original CSV Data
       │
       ├─ Redaction ──────────► Field removed entirely
       │                         Value: [REDACTED] or null
       │                         Re-linkable: No
       │                         Use: Recipient has no need for this field
       │
       ├─ Character Masking ──► Partial value visible
       │                         Value: j***.***@c******.com
       │                         Re-linkable: No
       │                         Use: Format visible, value obscured
       │
       ├─ Pseudonymization ──► Consistent fake value
       │                         Value: [email protected]
       │                         Re-linkable: Yes (key table)
       │                         Use: Analytics, ML training, test data
       │
       ├─ Hashing ────────────► One-way digest
       │                         Value: 3b4c5d8e... (64-char hex)
       │                         Re-linkable: No (unless brute-forced)
       │                         Use: Cross-dataset matching, deduplication
       │
       └─ Tokenization ───────► Vault-based token
                                Value: TKN-8829-4471-9920
                                Re-linkable: Yes (vault access)
                                Use: PCI DSS scope reduction, authorized retrieval

All five produce PSEUDONYMIZED data under GDPR Article 4(5).
GDPR obligations apply in full to all outputs above.
True anonymization (GDPR Recital 26) requires a separate assessment.
Do you need to re-link masked data back to individuals later?
│
├─ YES (fraud investigation, SAR response, audit trail)
│   │
│   ├─ Must the masked version look realistic?
│   │   ├─ YES → Pseudonymization (consistent fake values)
│   │   └─ NO  → Tokenization (reversible token + key table)
│   │
│   └─ Key table can be kept securely? → Both above require key management
│
└─ NO (one-way operation, test data, external sharing)
    │
    ├─ Recipient needs to read/parse the field values?
    │   ├─ YES → Character Masking (preserves format, partial values visible)
    │   └─ NO  → Redaction (remove field entirely)
    │
    └─ Recipient needs to match/deduplicate across datasets?
        └─ YES → Hashing (consistent digest, no original value visible)
                  ⚠️ Only use for high-entropy values (not names, common emails)
TechniqueRe-linkable?Format preserved?GDPR statusBest for
RedactionNoNoPseudonymizedFields not needed by recipient
Character MaskingNoYes (partial)PseudonymizedHuman-readable partial values
PseudonymizationYes (key table)Yes (realistic)PseudonymizedTest data, consistent replacements
HashingNo (unless brute-forced)No (digest)PseudonymizedCross-dataset matching, deduplication
TokenizationYes (token vault)No (token)PseudonymizedPCI DSS cardholder data, reversible

Technique 1: Redaction

Redaction removes the field value entirely, replacing it with a null value, empty string, or placeholder such as [REDACTED].

When to use it: When the recipient has no legitimate need for the specific field. If you are sharing customer purchase behavior with an analytics firm, customer names and email addresses are almost certainly unnecessary for the analysis. Removing them entirely eliminates the exposure rather than managing it.

What it looks like:

BeforeAfter
Name: Maria SantosName: [REDACTED]
Email: [email protected]Email:
National ID: 123-45-6789National ID: [REDACTED]

GDPR status: The redacted record may still be pseudonymized data if the remaining fields allow re-identification. Removing one identifier from a 20-column dataset does not automatically anonymize the record. Apply data minimization (GDPR Article 5(1)(c)) — strip all columns the recipient does not need, not just the obvious PII columns.

What it does not do: Redaction does not prevent re-identification through quasi-identifiers. A record with name and email redacted but containing date of birth, postcode, employer, and device type may remain individually identifiable. Assess the remaining fields for re-identification risk before treating the output as safe to share.


Technique 2: Character Masking

Character masking replaces parts of a field value with neutral characters (typically asterisks or Xs) while preserving the surrounding structure. The goal is to render the value unreadable while keeping the field format recognizable.

When to use it: When the recipient needs to understand the data format without seeing the actual values. Customer support teams who need to confirm a customer's email domain without seeing the full address. Audit logs that need to show a transaction occurred without exposing the account number.

What it looks like:

BeforeAfter (format preserved)
Email: [email protected]Email: j***.@c***.com
Phone: +44 7911 123456Phone: +44 **** ***456
Credit card: 4929 1234 5678 9012Credit card: 4929 **** **** 9012
Date of birth: 1985-03-12Date of birth: 1985--

GDPR status: Always pseudonymization. The original value exists in the source system. The masked value often reveals enough structure to narrow re-identification, particularly for low-population data. A masked email that shows the first initial and full domain may be sufficient to identify an individual within a small organization.

What it does not do: Character masking does not prevent re-identification by someone who knows the partial values. It is appropriate for reducing casual exposure — not for sharing with untrusted external parties.


Technique 3: Pseudonymization (Value Substitution)

Pseudonymization, as a specific technique, replaces real field values with consistent fictional alternatives. The same real value always maps to the same pseudonym. Maria Santos always becomes User_4821. Account number 123456 always becomes ACCT_8820. This consistency is what distinguishes pseudonymization from random masking.

When to use it: When the recipient needs data that behaves like real data — for analytics that join on customer IDs, for ML training that requires consistent entity representations, or for development environments that need realistic but non-identifiable datasets.

What it looks like:

BeforeAfter (consistent substitution)
CustomerID: C-10045CustomerID: USR-8829
Name: Maria SantosName: Alex Rivera
Email: [email protected]Email: [email protected]
Phone: +44 7911 123456Phone: +44 7700 900142

The substitution is deterministic: the same input always produces the same output. The relationship between Customer C-10045 and their purchase history is preserved in the pseudonymized dataset — but the identity is replaced.

GDPR status: Always pseudonymized data (GDPR Article 4(5)). The key table linking real identities to pseudonyms must be stored securely and separately. The output CSV is personal data. A Data Processing Agreement is still required if you share it with an external party. All data subject rights apply.

What it does not do: Pseudonymization does not prevent re-identification if the key table is compromised or if the pseudonymized fields can be cross-referenced against external datasets to recover the original identity.


Technique 4: Hashing

Hashing applies a one-way cryptographic function to the field value, producing a fixed-length digest. The same input always produces the same digest (deterministic), but you cannot recover the original value from the digest alone.

When to use it: When you need consistent identifiers for cross-dataset matching or deduplication without retaining the original values. Comparing two customer lists to find overlap without exposing customer emails to the party performing the comparison. Generating unique identifiers for analytics without storing the underlying PII.

What it looks like:

BeforeAfter (SHA-256)
Email: [email protected]Email: 3b4c5d... (64-char hex string)
National ID: 123-45-6789National ID: a7f2e1... (64-char hex string)

Basic implementation (JavaScript, for reference):

// Hashing an email address with a salt using SHA-256
// Salt should be a secret, random string stored securely — not hardcoded
async function hashEmail(email, salt) {
  const input = email.toLowerCase().trim() + salt;
  const encoded = new TextEncoder().encode(input);
  const hashBuffer = await crypto.subtle.digest('SHA-256', encoded);
  const hashArray = Array.from(new Uint8Array(hashBuffer));
  return hashArray.map(b => b.toString(16).padStart(2, '0')).join('');
}

// Usage:
// hashEmail('[email protected]', 'your-secret-salt-here')
// → '3b4c5d8e9f...' (consistent digest for same input)
// The same email always produces the same hash (deterministic)
// Different salts produce entirely different hashes (preventing rainbow tables)

This runs entirely in the browser — no server required. The crypto.subtle API is available in all modern browsers without any library dependencies. Note that even with salting, hashed email addresses remain pseudonymized data under GDPR if the underlying email format is guessable from the output length and character set.

GDPR status: Still pseudonymized data in most practical cases. For low-entropy inputs — names, common email formats, national ID number patterns — hashed values can be reversed using precomputed rainbow tables or brute force. The EDPB's guidance on anonymization techniques explicitly notes that hashing does not achieve anonymization for guessable input values.

Salted hashing (adding a secret random value to the input before hashing) significantly improves security but requires careful key management. Even salted hashes do not meet the GDPR Recital 26 anonymization standard if the original values can be guessed.

Why this matters in practice — the AOL case: In 2006, AOL released 20 million "anonymized" search queries with usernames replaced by numeric IDs. Researchers demonstrated that users could be identified from search patterns alone — without any cryptographic reversal. Hashed identifiers provide stronger protection than numeric IDs, but the underlying principle applies: when the original value is guessable from context or external data, the hash provides weaker protection than its cryptographic strength suggests.

What it does not do: Hashing is not anonymization. For high-entropy values (random UUIDs, cryptographic keys), hashing provides strong pseudonymization. For common personal data values, assume the hash is reversible and maintain full GDPR compliance.


Technique 5: Tokenization

Tokenization replaces field values with randomly generated tokens that have no mathematical relationship to the original values. A centralized token vault stores the mapping between tokens and original values. The token can be exchanged for the original only by parties with access to the vault.

When to use it: When you need to reduce PCI DSS cardholder data scope (replacing card numbers with tokens in systems that process but do not need the real numbers), or when you need a reversible de-identification method with strict access control over who can recover original values.

What it looks like:

BeforeAfter
Card: 4929 1234 5678 9012Card: TKN-8829-4471-9920
Account: ACC-10045Account: TKN-2254-8819-0011

GDPR status: Always pseudonymized data. The token vault holds the mapping to original values. Whoever controls vault access can recover the originals. GDPR applies in full to both the tokenized data and the vault.

Key difference from hashing: Tokenization is explicitly reversible by design — that is its purpose. Hashing is intended to be one-way. Use tokenization when re-linkability is a feature (PCI DSS compliance, authorized retrieval workflows). Use hashing when you want one-way consistency without a key management dependency.


GDPR Status of Each Technique

This is the single most important thing to understand about PII masking: every technique in this guide produces pseudonymized data, not anonymized data, in almost all practical scenarios. GDPR applies to pseudonymized data in full.

TechniqueTypically Pseudonymized?Meets Recital 26 Anonymization?GDPR Obligations
RedactionYes — remaining fields may allow re-identificationOnly if ALL quasi-identifiers are also removedFull GDPR obligations apply
Character maskingYes — original exists in source; partial values remainNoFull GDPR obligations apply
Value substitutionYes — key table allows re-identificationNoFull GDPR obligations apply
Hashing (unsalted)Yes — reversible for low-entropy inputsNoFull GDPR obligations apply
Hashing (salted, high-entropy)Reduced re-identification risk but typically still pseudonymizedContextual — requires full quasi-identifier analysisFull GDPR obligations apply unless Recital 26 standard confirmed
TokenizationYes — vault enables re-identificationNoFull GDPR obligations apply

Read our pseudonymization vs. anonymization guide for the legal framework governing when data genuinely falls outside GDPR scope.


End-to-End Example: Selecting and Applying a Technique

This walkthrough covers the complete process from identifying a field to producing the masked output — the decision a data engineer or analyst makes in practice.

Scenario: You have a customer CSV with these columns and are preparing it for handoff to an external analytics firm:

customer_id, email, full_name, phone, dob, postcode, revenue_band, churn_risk_score

Step 1 — Classify each field:

FieldTypeContains Direct Identifier?Recipient Needs It?
customer_idInternal keyNo — opaque IDYes — for record linkage
emailDirect identifierYesNo
full_nameDirect identifierYesNo
phoneDirect identifierYesNo
dobQuasi-identifierNo — but narrows populationNo
postcodeQuasi-identifierNo — but narrows geographySector only (first 4 chars)
revenue_bandCategoricalNoYes
churn_risk_scoreAnalytical outputNoYes

Step 2 — Apply data minimization first: Drop email, full_name, phone entirely — the recipient has no stated need for these fields. Stripping them is more protective than masking.

Step 3 — Apply technique to remaining sensitive fields:

FieldDecisionTechniqueOutput Example
customer_idReplace with consistent pseudonymSalted hashf7e2a1b3... (consistent per customer)
dobNot needed — stripRedaction[REMOVED]
postcodeNeeded at sector levelTruncation (character masking)SW1A 1AASW1A
revenue_bandKeep as-isNoneMid
churn_risk_scoreKeep as-isNone0.72

Step 4 — Output:

customer_hash, postcode_sector, revenue_band, churn_risk_score
f7e2a1b3..., SW1A, Mid, 0.72
a9c2f1d4..., EC1A, High, 0.41

What this achieves: No direct identifiers remain. The customer_hash allows the analytics firm to track the same customer across rows without knowing their identity — this is pseudonymization under GDPR Article 4(5), not anonymization. A DPA with the recipient is still required. The firm cannot re-identify customers from this output alone, but re-identification risk from quasi-identifier combinations (postcode_sector + revenue_band) should be assessed if the recipient has access to external datasets.

Choosing the Right Technique

Before sharing data with an external party: Apply pseudonymization (value substitution or redaction) to all direct identifiers. Strip all columns the recipient does not need. Assess remaining quasi-identifiers for re-identification risk. Sign a DPA with the recipient before transmission. Process locally if possible to avoid creating an additional processor relationship.

Before using data in development or testing environments: Apply value substitution (pseudonymization) to all PII fields. Use realistic but non-identifiable values. Never use raw production data in non-production environments — a principle recommended by both the EDPB and NIST.

Before sharing data for cross-dataset matching: Apply salted hashing to the matching key field (email, account number). Both parties hash their data with the same salt. Comparison is possible without either party exposing raw values. Remember that the matched records remain pseudonymized — the analysis output may still identify individuals if it links back to personal attributes.

For payment card data (PCI DSS scope reduction): Apply tokenization to cardholder data. This is the PCI DSS-recommended approach for reducing scope — systems holding only tokens do not need full PCI DSS compliance if the token vault is properly isolated.

Many CSV processing tools upload your file to remote servers to apply these techniques. Many SaaS tools retain uploaded files temporarily for debugging, caching, or processing purposes — retention policies vary by vendor. For files containing PII, this creates an Article 28 processor relationship before any masking has been applied. SplitForge applies all five techniques locally in your browser — for raw file contents, the file is never transmitted to any server during masking operations.


Additional Resources

GDPR and Data Protection Standards:

Technical Standards:

Further Reading:


FAQ

Masking PII reduces re-identification risk and is a good security practice, but it does not make data GDPR-compliant to share without other safeguards. Masked data is pseudonymized — still personal data under GDPR Article 4(5). Before sharing with an external party, you still need a legal basis for the processing, a signed Data Processing Agreement with the recipient, and compliance with any applicable transfer restrictions. Masking reduces the harm if something goes wrong; it does not remove the legal obligations that apply before you share.

Masking replaces part of a value with neutral characters (asterisks, Xs) while preserving format. Pseudonymization replaces the entire value with a consistent fictional alternative. Both techniques produce pseudonymized data under GDPR. The key practical difference is consistency: pseudonymization produces the same replacement for the same input across all records, preserving analytical relationships. Masking typically does not preserve this consistency.

For PHI in CSV files, the most important HIPAA consideration is not which masking technique you use — it is whether the masking is applied before the file reaches any external server. A masked file that is uploaded to a cloud tool without a BAA creates HIPAA exposure at the point of upload, before the masking provides any protection. Apply masking locally, then transmit or store the masked result. For data that needs to be fully de-identified under HIPAA's Safe Harbor standard, all 18 HIPAA identifiers must be removed or generalized — not merely masked.

Generally no. Email addresses for common domains (gmail.com, outlook.com, company domains) have relatively low entropy. Precomputed rainbow tables for common email formats can reverse unsalted SHA-256 hashes at scale. The EDPB's anonymization guidance explicitly notes that hashing does not achieve anonymization for guessable inputs. Salted hashing significantly improves security but does not automatically meet the GDPR Recital 26 anonymization standard. Treat hashed email addresses as pseudonymized personal data and maintain full GDPR compliance.

No. Tokenization replaces values with randomly generated tokens that have no mathematical relationship to the originals. The mapping lives in a token vault. Encryption transforms values using a mathematical algorithm that can be reversed with the correct key. Both produce pseudonymized data under GDPR. The practical distinction: tokenization is typically used to reduce scope (e.g., PCI DSS cardholder data) by replacing sensitive values with non-sensitive tokens in downstream systems. Encryption is used to protect data in transit or at rest. Both require key or vault management; compromise of the key or vault restores the original values.

Yes. Masked CSV files are pseudonymized data — still personal data under GDPR Article 4(5). If you share them with a third party that processes them on your behalf, that party is a processor under GDPR Article 4(8), and an Article 28 DPA is required before sharing. The masking reduces the risk if the file is misused or breached, but it does not change the legal classification of the data or the contractual obligations that apply.



Legal disclaimer: The content in this post is for informational purposes only and does not constitute legal advice. GDPR compliance for any specific data processing activity depends on your data types, purposes, and jurisdiction. Consult qualified legal counsel before relying on any masking technique for compliance purposes.

Mask PII Without Uploading It to Another Server

Apply all 5 masking techniques directly in your browser — no installation, no upload
Files never transmitted during processing — no Article 28 processor risk for masking operations
Process up to 10 million rows with consistent substitution, hashing, and redaction
Export masked results immediately — ready for external sharing with DPA in place

Continue Reading

More guides to help you work smarter with your data

ai-data-prep

AI-Ready Data Checklist: 10 Things to Verify Before Upload (2026)

Before uploading to ChatGPT, Claude, or a fine-tuning API, run through this 10-point checklist. UTF-8 encoding, clean headers, PII removed, size within limits.

Read More
ai-data-prep

Convert Excel to JSON for AI APIs and LLM Pipelines (2026)

AI APIs and LLM pipelines expect JSON, not spreadsheets. Fine-tuning needs JSONL; direct prompts take arrays. Convert locally — no upload, no conversion server.

Read More
ai-data-prep

Prepare Data for AI: The Complete Guide (Privacy-First, 2026)

How to prepare a CSV or Excel file for ChatGPT, Claude, or an AI API — encoding, PII, format, size, and privacy. The complete local-first prep workflow.

Read More