Quick Answer
What are the main PII masking techniques for CSV files?
Five techniques cover virtually every use case: redaction (remove the value entirely), masking (replace characters with asterisks, preserving structure), pseudonymization (replace real values with consistent fake values), hashing (replace values with one-way cryptographic digests), and tokenization (replace values with reversible tokens linked to a key table). Each technique produces a different result for re-identification risk, data utility, and GDPR status.
Critical distinction: All five of these techniques typically produce pseudonymized data — still regulated under GDPR. Only genuine anonymization, which meets the Recital 26 standard, removes GDPR obligations entirely. See our pseudonymization vs. anonymization guide for the legal details.
TL;DR: PII masking in CSV files doesn't require enterprise software. Redaction, masking, pseudonymization, hashing, and tokenization each serve different purposes and produce different levels of re-identification protection. The right technique depends on whether you need to re-link data later and who will receive the processed file. SplitForge Data Masking applies all five techniques locally in your browser — the file never leaves your device.
Your marketing team needs to share customer data with an external analytics firm. Your HR department needs to send payroll records to a benefits administrator. Your development team needs realistic test data without using production records.
In every case, the same problem: how do you reduce the personal data in a CSV to the minimum necessary for the task, without uploading your files to yet another cloud service and without spending a week configuring enterprise software?
Enterprise PII masking tools — IRI FieldShield, SAS Studio Data Masking, Strac — are powerful. They are also designed for organizations with dedicated data engineering teams, IT procurement budgets, and weeks to implement. They require either a local installation or a cloud upload of the exact data you are trying to protect.
There is a simpler path for the CSV use case. Five techniques, applied in minutes, cover the practical range of PII masking needs. Each technique was tested against synthetic CSV files containing names, email addresses, dates of birth, national ID numbers, and financial account data using SplitForge Data Masking tool, March 2026.
Table of Contents
- The PII Masking Decision Tree
- Five Techniques at a Glance
- Technique 1: Redaction
- Technique 2: Character Masking
- Technique 3: Pseudonymization (Value Substitution)
- Technique 4: Hashing
- Technique 5: Tokenization
- GDPR Status of Each Technique
- Choosing the Right Technique
- Additional Resources
- FAQ
This guide is for: Data analysts, compliance teams, marketing operations, and HR teams who need to mask PII in CSV files before sharing, testing, or external processing — without enterprise tooling.
The PII Masking Decision Tree
Use this tree to identify the right technique before you start. The key questions are: do you need the data to be usable after masking, and do you need to re-link it to the original later?
Five Techniques at a Glance
Original CSV Data
│
├─ Redaction ──────────► Field removed entirely
│ Value: [REDACTED] or null
│ Re-linkable: No
│ Use: Recipient has no need for this field
│
├─ Character Masking ──► Partial value visible
│ Value: j***.***@c******.com
│ Re-linkable: No
│ Use: Format visible, value obscured
│
├─ Pseudonymization ──► Consistent fake value
│ Value: [email protected]
│ Re-linkable: Yes (key table)
│ Use: Analytics, ML training, test data
│
├─ Hashing ────────────► One-way digest
│ Value: 3b4c5d8e... (64-char hex)
│ Re-linkable: No (unless brute-forced)
│ Use: Cross-dataset matching, deduplication
│
└─ Tokenization ───────► Vault-based token
Value: TKN-8829-4471-9920
Re-linkable: Yes (vault access)
Use: PCI DSS scope reduction, authorized retrieval
All five produce PSEUDONYMIZED data under GDPR Article 4(5).
GDPR obligations apply in full to all outputs above.
True anonymization (GDPR Recital 26) requires a separate assessment.
Do you need to re-link masked data back to individuals later?
│
├─ YES (fraud investigation, SAR response, audit trail)
│ │
│ ├─ Must the masked version look realistic?
│ │ ├─ YES → Pseudonymization (consistent fake values)
│ │ └─ NO → Tokenization (reversible token + key table)
│ │
│ └─ Key table can be kept securely? → Both above require key management
│
└─ NO (one-way operation, test data, external sharing)
│
├─ Recipient needs to read/parse the field values?
│ ├─ YES → Character Masking (preserves format, partial values visible)
│ └─ NO → Redaction (remove field entirely)
│
└─ Recipient needs to match/deduplicate across datasets?
└─ YES → Hashing (consistent digest, no original value visible)
⚠️ Only use for high-entropy values (not names, common emails)
| Technique | Re-linkable? | Format preserved? | GDPR status | Best for |
|---|---|---|---|---|
| Redaction | No | No | Pseudonymized | Fields not needed by recipient |
| Character Masking | No | Yes (partial) | Pseudonymized | Human-readable partial values |
| Pseudonymization | Yes (key table) | Yes (realistic) | Pseudonymized | Test data, consistent replacements |
| Hashing | No (unless brute-forced) | No (digest) | Pseudonymized | Cross-dataset matching, deduplication |
| Tokenization | Yes (token vault) | No (token) | Pseudonymized | PCI DSS cardholder data, reversible |
Technique 1: Redaction
Redaction removes the field value entirely, replacing it with a null value, empty string, or placeholder such as [REDACTED].
When to use it: When the recipient has no legitimate need for the specific field. If you are sharing customer purchase behavior with an analytics firm, customer names and email addresses are almost certainly unnecessary for the analysis. Removing them entirely eliminates the exposure rather than managing it.
What it looks like:
| Before | After |
|---|---|
| Name: Maria Santos | Name: [REDACTED] |
| Email: [email protected] | Email: |
| National ID: 123-45-6789 | National ID: [REDACTED] |
GDPR status: The redacted record may still be pseudonymized data if the remaining fields allow re-identification. Removing one identifier from a 20-column dataset does not automatically anonymize the record. Apply data minimization (GDPR Article 5(1)(c)) — strip all columns the recipient does not need, not just the obvious PII columns.
What it does not do: Redaction does not prevent re-identification through quasi-identifiers. A record with name and email redacted but containing date of birth, postcode, employer, and device type may remain individually identifiable. Assess the remaining fields for re-identification risk before treating the output as safe to share.
Technique 2: Character Masking
Character masking replaces parts of a field value with neutral characters (typically asterisks or Xs) while preserving the surrounding structure. The goal is to render the value unreadable while keeping the field format recognizable.
When to use it: When the recipient needs to understand the data format without seeing the actual values. Customer support teams who need to confirm a customer's email domain without seeing the full address. Audit logs that need to show a transaction occurred without exposing the account number.
What it looks like:
| Before | After (format preserved) |
|---|---|
| Email: [email protected] | Email: j***.@c***.com |
| Phone: +44 7911 123456 | Phone: +44 **** ***456 |
| Credit card: 4929 1234 5678 9012 | Credit card: 4929 **** **** 9012 |
| Date of birth: 1985-03-12 | Date of birth: 1985-- |
GDPR status: Always pseudonymization. The original value exists in the source system. The masked value often reveals enough structure to narrow re-identification, particularly for low-population data. A masked email that shows the first initial and full domain may be sufficient to identify an individual within a small organization.
What it does not do: Character masking does not prevent re-identification by someone who knows the partial values. It is appropriate for reducing casual exposure — not for sharing with untrusted external parties.
Technique 3: Pseudonymization (Value Substitution)
Pseudonymization, as a specific technique, replaces real field values with consistent fictional alternatives. The same real value always maps to the same pseudonym. Maria Santos always becomes User_4821. Account number 123456 always becomes ACCT_8820. This consistency is what distinguishes pseudonymization from random masking.
When to use it: When the recipient needs data that behaves like real data — for analytics that join on customer IDs, for ML training that requires consistent entity representations, or for development environments that need realistic but non-identifiable datasets.
What it looks like:
| Before | After (consistent substitution) |
|---|---|
| CustomerID: C-10045 | CustomerID: USR-8829 |
| Name: Maria Santos | Name: Alex Rivera |
| Email: [email protected] | Email: [email protected] |
| Phone: +44 7911 123456 | Phone: +44 7700 900142 |
The substitution is deterministic: the same input always produces the same output. The relationship between Customer C-10045 and their purchase history is preserved in the pseudonymized dataset — but the identity is replaced.
GDPR status: Always pseudonymized data (GDPR Article 4(5)). The key table linking real identities to pseudonyms must be stored securely and separately. The output CSV is personal data. A Data Processing Agreement is still required if you share it with an external party. All data subject rights apply.
What it does not do: Pseudonymization does not prevent re-identification if the key table is compromised or if the pseudonymized fields can be cross-referenced against external datasets to recover the original identity.
Technique 4: Hashing
Hashing applies a one-way cryptographic function to the field value, producing a fixed-length digest. The same input always produces the same digest (deterministic), but you cannot recover the original value from the digest alone.
When to use it: When you need consistent identifiers for cross-dataset matching or deduplication without retaining the original values. Comparing two customer lists to find overlap without exposing customer emails to the party performing the comparison. Generating unique identifiers for analytics without storing the underlying PII.
What it looks like:
| Before | After (SHA-256) |
|---|---|
| Email: [email protected] | Email: 3b4c5d... (64-char hex string) |
| National ID: 123-45-6789 | National ID: a7f2e1... (64-char hex string) |
Basic implementation (JavaScript, for reference):
// Hashing an email address with a salt using SHA-256
// Salt should be a secret, random string stored securely — not hardcoded
async function hashEmail(email, salt) {
const input = email.toLowerCase().trim() + salt;
const encoded = new TextEncoder().encode(input);
const hashBuffer = await crypto.subtle.digest('SHA-256', encoded);
const hashArray = Array.from(new Uint8Array(hashBuffer));
return hashArray.map(b => b.toString(16).padStart(2, '0')).join('');
}
// Usage:
// hashEmail('[email protected]', 'your-secret-salt-here')
// → '3b4c5d8e9f...' (consistent digest for same input)
// The same email always produces the same hash (deterministic)
// Different salts produce entirely different hashes (preventing rainbow tables)
This runs entirely in the browser — no server required. The crypto.subtle API is available in all modern browsers without any library dependencies. Note that even with salting, hashed email addresses remain pseudonymized data under GDPR if the underlying email format is guessable from the output length and character set.
GDPR status: Still pseudonymized data in most practical cases. For low-entropy inputs — names, common email formats, national ID number patterns — hashed values can be reversed using precomputed rainbow tables or brute force. The EDPB's guidance on anonymization techniques explicitly notes that hashing does not achieve anonymization for guessable input values.
Salted hashing (adding a secret random value to the input before hashing) significantly improves security but requires careful key management. Even salted hashes do not meet the GDPR Recital 26 anonymization standard if the original values can be guessed.
Why this matters in practice — the AOL case: In 2006, AOL released 20 million "anonymized" search queries with usernames replaced by numeric IDs. Researchers demonstrated that users could be identified from search patterns alone — without any cryptographic reversal. Hashed identifiers provide stronger protection than numeric IDs, but the underlying principle applies: when the original value is guessable from context or external data, the hash provides weaker protection than its cryptographic strength suggests.
What it does not do: Hashing is not anonymization. For high-entropy values (random UUIDs, cryptographic keys), hashing provides strong pseudonymization. For common personal data values, assume the hash is reversible and maintain full GDPR compliance.
Technique 5: Tokenization
Tokenization replaces field values with randomly generated tokens that have no mathematical relationship to the original values. A centralized token vault stores the mapping between tokens and original values. The token can be exchanged for the original only by parties with access to the vault.
When to use it: When you need to reduce PCI DSS cardholder data scope (replacing card numbers with tokens in systems that process but do not need the real numbers), or when you need a reversible de-identification method with strict access control over who can recover original values.
What it looks like:
| Before | After |
|---|---|
| Card: 4929 1234 5678 9012 | Card: TKN-8829-4471-9920 |
| Account: ACC-10045 | Account: TKN-2254-8819-0011 |
GDPR status: Always pseudonymized data. The token vault holds the mapping to original values. Whoever controls vault access can recover the originals. GDPR applies in full to both the tokenized data and the vault.
Key difference from hashing: Tokenization is explicitly reversible by design — that is its purpose. Hashing is intended to be one-way. Use tokenization when re-linkability is a feature (PCI DSS compliance, authorized retrieval workflows). Use hashing when you want one-way consistency without a key management dependency.
GDPR Status of Each Technique
This is the single most important thing to understand about PII masking: every technique in this guide produces pseudonymized data, not anonymized data, in almost all practical scenarios. GDPR applies to pseudonymized data in full.
| Technique | Typically Pseudonymized? | Meets Recital 26 Anonymization? | GDPR Obligations |
|---|---|---|---|
| Redaction | Yes — remaining fields may allow re-identification | Only if ALL quasi-identifiers are also removed | Full GDPR obligations apply |
| Character masking | Yes — original exists in source; partial values remain | No | Full GDPR obligations apply |
| Value substitution | Yes — key table allows re-identification | No | Full GDPR obligations apply |
| Hashing (unsalted) | Yes — reversible for low-entropy inputs | No | Full GDPR obligations apply |
| Hashing (salted, high-entropy) | Reduced re-identification risk but typically still pseudonymized | Contextual — requires full quasi-identifier analysis | Full GDPR obligations apply unless Recital 26 standard confirmed |
| Tokenization | Yes — vault enables re-identification | No | Full GDPR obligations apply |
Read our pseudonymization vs. anonymization guide for the legal framework governing when data genuinely falls outside GDPR scope.
End-to-End Example: Selecting and Applying a Technique
This walkthrough covers the complete process from identifying a field to producing the masked output — the decision a data engineer or analyst makes in practice.
Scenario: You have a customer CSV with these columns and are preparing it for handoff to an external analytics firm:
customer_id, email, full_name, phone, dob, postcode, revenue_band, churn_risk_score
Step 1 — Classify each field:
| Field | Type | Contains Direct Identifier? | Recipient Needs It? |
|---|---|---|---|
| customer_id | Internal key | No — opaque ID | Yes — for record linkage |
| Direct identifier | Yes | No | |
| full_name | Direct identifier | Yes | No |
| phone | Direct identifier | Yes | No |
| dob | Quasi-identifier | No — but narrows population | No |
| postcode | Quasi-identifier | No — but narrows geography | Sector only (first 4 chars) |
| revenue_band | Categorical | No | Yes |
| churn_risk_score | Analytical output | No | Yes |
Step 2 — Apply data minimization first: Drop email, full_name, phone entirely — the recipient has no stated need for these fields. Stripping them is more protective than masking.
Step 3 — Apply technique to remaining sensitive fields:
| Field | Decision | Technique | Output Example |
|---|---|---|---|
| customer_id | Replace with consistent pseudonym | Salted hash | f7e2a1b3... (consistent per customer) |
| dob | Not needed — strip | Redaction | [REMOVED] |
| postcode | Needed at sector level | Truncation (character masking) | SW1A 1AA → SW1A |
| revenue_band | Keep as-is | None | Mid |
| churn_risk_score | Keep as-is | None | 0.72 |
Step 4 — Output:
customer_hash, postcode_sector, revenue_band, churn_risk_score
f7e2a1b3..., SW1A, Mid, 0.72
a9c2f1d4..., EC1A, High, 0.41
What this achieves: No direct identifiers remain. The customer_hash allows the analytics firm to track the same customer across rows without knowing their identity — this is pseudonymization under GDPR Article 4(5), not anonymization. A DPA with the recipient is still required. The firm cannot re-identify customers from this output alone, but re-identification risk from quasi-identifier combinations (postcode_sector + revenue_band) should be assessed if the recipient has access to external datasets.
Choosing the Right Technique
Before sharing data with an external party: Apply pseudonymization (value substitution or redaction) to all direct identifiers. Strip all columns the recipient does not need. Assess remaining quasi-identifiers for re-identification risk. Sign a DPA with the recipient before transmission. Process locally if possible to avoid creating an additional processor relationship.
Before using data in development or testing environments: Apply value substitution (pseudonymization) to all PII fields. Use realistic but non-identifiable values. Never use raw production data in non-production environments — a principle recommended by both the EDPB and NIST.
Before sharing data for cross-dataset matching: Apply salted hashing to the matching key field (email, account number). Both parties hash their data with the same salt. Comparison is possible without either party exposing raw values. Remember that the matched records remain pseudonymized — the analysis output may still identify individuals if it links back to personal attributes.
For payment card data (PCI DSS scope reduction): Apply tokenization to cardholder data. This is the PCI DSS-recommended approach for reducing scope — systems holding only tokens do not need full PCI DSS compliance if the token vault is properly isolated.
Many CSV processing tools upload your file to remote servers to apply these techniques. Many SaaS tools retain uploaded files temporarily for debugging, caching, or processing purposes — retention policies vary by vendor. For files containing PII, this creates an Article 28 processor relationship before any masking has been applied. SplitForge applies all five techniques locally in your browser — for raw file contents, the file is never transmitted to any server during masking operations.
Additional Resources
GDPR and Data Protection Standards:
- EDPB Guidelines 05/2014 on Anonymization Techniques — Authoritative guidance on what constitutes effective anonymization vs. pseudonymization
- GDPR Article 4(5) — Definition of pseudonymisation — Legal definition confirming pseudonymized data remains personal data
- GDPR Recital 26 — Anonymous information — The standard under which data falls outside GDPR scope
Technical Standards:
- NIST SP 800-188: De-identification of Government Datasets — Technical framework for data de-identification techniques and risk assessment
- HHS De-identification Guidance for HIPAA — Safe Harbor and Expert Determination methods for PHI
Further Reading:
- SplitForge: Pseudonymization vs. Anonymization — GDPR Legal Distinction — Why masking is pseudonymization, not anonymization, and what that means for GDPR obligations