Verified Benchmark — February 2026

Pattern Extraction Performance:
5M Rows in 45 Seconds (2026 Benchmark)

~111K rows/second (balanced mode, dataset-dependent). Emails, phones, credit cards, SSNs, URLs, dates, IP addresses, and ZIP codes — all extracted and validated in your browser with zero server transmission. Works alongside Data Masking for full PII workflows, or see the full tool overview.

~111K/s
Peak Throughput
rows/sec (balanced mode)
5M+
Maximum Tested
rows (~640MB)
Never
File Uploads
zero transmission
8
Pattern Types
email, phone, CC, SSN...

Benchmark Performance

All times: Chrome 131 (stable), Windows 11, Intel i7-12700K, 32GB RAM, February 2026. 10 runs per configuration — highest/lowest discarded, remaining 8 averaged. Results vary by hardware, browser, and file density. 10M row projection not yet benchmarked — extrapolated from 5M results.

Performance at Scale

Chrome 131 (stable) · Windows 11 · Intel i7-12700K · 32GB RAM · February 2026

ConfigurationTimeThroughputTest Notes
100K rows · 4 types · balanced~0.8 sec~125K rows/secStartup overhead visible at small sizes
1M rows · 4 types · balanced~8.5 sec~118K rows/secComma-separated values, mixed data types
5M rows · 6 types · balanced~45 sec~111K rows/secPeak benchmark — 640MB CSV, 551K patterns found
5M rows · 8 types · strict~58 sec~86K rows/secAll types + Luhn + SSA + area code validation
5M rows · 2 types · permissive~28 sec~179K rows/secEmail + phone only, no deep validation
~1GB file · 4 types · balanced~75 sec~95K rows/secMaximum tested capacity, browser-dependent
2.1M rows · 4 types · balanced~18.4 sec~114K rows/secReal-world CRM export — mixed free-text notes fields, anonymized dataset, Feb 2026

Results vary by hardware, browser, number of pattern types active, validation mode, and file data density. Throughput increases with permissive mode and fewer pattern types; decreases with strict mode and all 8 types.

SplitForge vs. Python: Which Tool Fits Your Workflow?

No single tool is right for every situation. Here's an honest breakdown.

Use SplitForge when:
  • Data contains PII, PHI, or regulated financial information that cannot leave your device
  • You need Luhn, SSA, or NANP validation without writing or maintaining custom code
  • You want normalization (E.164, ISO 8601) in the same step as extraction
  • You need context view for audit trails or compliance review documentation
  • File is under ~1GB and the task is a one-off or periodic workflow
  • You don't write Python — or don't want to for this specific task
  • Speed of ~111K rows/sec is sufficient for your dataset size
Use Python re / pandas when:
  • You need to run extraction on a schedule or inside an automated data pipeline
  • You need custom regex patterns beyond the 8 built-in types
  • Files exceed 1GB or have 50M+ rows where browser memory limits apply
  • Extraction is part of a larger ETL or transformation workflow
  • You're comfortable writing and maintaining Python code
  • You need output piped directly into another process or database

Feature Performance Overhead

Permissive Mode (Baseline)
Baseline
~179K rows/sec
Regex matching only — no validation logic beyond the initial pattern match. Fastest throughput, highest false positive rate. Use for internal data where you plan to review all results manually, or when recall matters more than precision.
Balanced Mode (Default)
+38% time
~111K rows/sec
Applies format validation rules on top of regex matching: US area code lookup for phone numbers, date range validation (1900–2100), protocol and TLD checks for URLs. Eliminates the majority of false positives with moderate overhead. Recommended for most workflows.
Strict Mode (Deep Validation)
+108% time
~86K rows/sec
Full validation stack: Luhn algorithm for every CC candidate, SSA area/group/serial rules for SSNs, IANA TLD lookup for URLs, area code validation against full NANP allocation table. Best precision — use when every flagged match will trigger an action (notification, CRM update, manual review).
Normalization (Phone + Email + Date)
+5% time
~105K rows/sec
Post-extraction format standardization: phone to E.164 (+15551234567), email to lowercase, date to ISO 8601. Operates on already-extracted match values with O(n) string operations. Negligible impact — always worth enabling before loading results into a database or CRM.
Deduplication (by value + type)
+10–12% time
~100K rows/sec
Set-based comparison tracking unique (value, type) pairs. Each match is checked against the seen-set — first occurrence kept, duplicates removed. Adds roughly 4–6 seconds on a 5M row file. Enable when you want a distinct list of values; disable when occurrence frequency matters.
All 8 Types + Strict + Dedupe
+129% time
~78K rows/sec
Maximum-cost configuration: all 8 pattern types active, strict validation on all, deduplication enabled. Best precision and most complete results, highest processing cost. Use for full PII audits, HIPAA de-identification scans, or PCI-DSS compliance sweeps where thoroughness is the priority over speed.
All overhead figures measured on 5M row dataset, February 2026, Chrome 131 (stable), 32GB RAM, Intel i7-12700K. Results vary by hardware, browser, data density, and which pattern types are active.

Input Format Performance

CSV (Streaming)
Recommended
~111K rows/sec
PapaParse chunk-based streaming. File is never fully loaded into memory — processes 60MB at a time. Fastest option for large files. Supports auto-detection of comma, tab, semicolon, pipe delimiters with 90%+ confidence scoring.
Excel / XLSX
Handles multi-sheet
~75K rows/sec
SheetJS reads the full file with memory-efficient flags (cellFormula: false, cellHTML: false, cellText: false) before streaming to the worker. Slower startup than CSV due to binary parsing, but handles multi-sheet workbooks with a sheet selector UI. Convert to CSV first for maximum speed on large Excel files.
Text Paste
Up to 10MB
~2–4 sec flat
Direct text or structured content pasted into the tool. No file reading overhead — processes immediately. Best for small ad-hoc extractions from clipboard content, emails, documents, or API responses. Memory-bound above ~50KB; upload a file instead for larger inputs.

Calculate Your Time Savings

Manual baseline: ~20 minutes per extraction task via Python scripting — based on internal workflow testing, February 2026. Covers: write regex per pattern type, handle encoding edge cases, test against sample data, debug false positives, run on full file, validate output counts. SplitForge processes any combination of the 8 pattern types in a single pass with validated results and normalization options in ~45 seconds (balanced mode, dataset-dependent).

Typical: 2–4 types per task

Monthly = 12, Weekly = 52, Daily = 260

Analyst avg: $45–75/hr

Annual Time Saved
51.4
hours per year
Annual Labor Savings
$2,568
per year (vs. Python scripting per task)
What you eliminate:
  • Writing and maintaining regex for each pattern type
  • Debugging false positives from regex-only matching
  • Writing Luhn algorithm and SSA validation logic from scratch
  • A separate normalization pass after extraction
  • The compliance risk of using online tools for PII data

Testing Methodology

10 runs per config · drop high/low · report avg · test datasets available on request · re-tested quarterly

Expand

Honest Limitations: Where SplitForge Pattern Extraction Falls Short

No tool is perfect for every use case. Here's where Server-Side Entity Extraction (Python re / AWS Comprehend / spaCy) might be a better choice, and the real limitations of our browser-based architecture.

Browser-Based Processing

Performance depends on your device's RAM and CPU. Modern laptops (2022+) handle 10M+ rows easily, but older devices may struggle with very large files.

Workaround:
Close unnecessary browser tabs to free up memory. For files over 50M rows, consider database solutions.

No Offline Mode (Initial Load)

Requires internet connection to load the tool initially. Processing happens offline in your browser after loading.

Workaround:
Once loaded, you can disconnect and continue processing. For true offline environments, desktop tools may be better.

Browser Tab Memory Limits

Most browsers limit individual tabs to 2-4GB RAM. This is the practical ceiling for file size.

Workaround:
Use 64-bit browsers with sufficient RAM. Chrome and Firefox handle large files best.

Browser Memory Ceiling (~1GB / 5–10M Rows)

Maximum practical file size is ~1GB (~5–10M rows, browser-dependent). Files larger than this risk hitting browser memory limits, especially with all 8 pattern types active.

Workaround:
Split large files first using SplitForge CSV Splitter, then scan each chunk. For 50M+ row files, use Python re + pandas with chunking, or AWS Comprehend for entity extraction at pipeline scale.

No Custom Regex Patterns

Pattern Extraction supports only the 8 built-in types. User-defined regex patterns for custom schemas (account IDs, internal product codes, proprietary identifiers) are not currently supported.

Workaround:
For custom patterns, use Python re.findall() with your regex, or grep -oP for shell-based extraction. Custom regex support is a planned roadmap item.

No Automation or API

Pattern Extraction is a browser tool — no REST API, CLI, or pipeline integration. Cannot be embedded in ETL workflows, scheduled jobs, or triggered programmatically.

Workaround:
For scheduled or automated extraction, use Python re + pandas in a cron job, or AWS Comprehend for entity recognition at pipeline scale with full orchestration.

1M Pattern Result Cap

Results are capped at 1 million patterns per session. On very dense files where nearly every row contains multiple patterns, processing stops before the full scan completes.

Workaround:
Enable deduplication (unique (value, type) pairs count toward the cap, not every occurrence), or narrow column selection to target only columns most likely to contain the patterns you need.

Single File Per Session

Pattern Extraction scans one file at a time. No batch processing across multiple files in a single operation.

Workaround:
Process files sequentially. For high-volume batch scanning (50+ files), use Python re in a loop or a shell script with grep -oP across all files.

When to Use Server-Side Entity Extraction (Python re / AWS Comprehend / spaCy) Instead

You need pattern extraction in an automated pipeline or scheduled job

SplitForge has no API. Browser-only workflow cannot run on a schedule or be triggered programmatically.

💡 Python re + pandas in a cron job, or AWS Comprehend for entity recognition in a Lambda function.

You need custom regex patterns for proprietary data schemas

SplitForge supports only the 8 built-in types. Custom schemas require user-defined regex.

💡 Python re.findall() with your pattern, or grep -oP for shell-based extraction.

You need to scan 50M+ row files

Browser memory limits practical ceiling to ~5–10M rows. Server-side tools scale horizontally.

💡 Python re + pandas with chunking (chunksize parameter), or AWS Comprehend for large-scale entity extraction.

You need named entity recognition beyond structured patterns

SplitForge extracts structured patterns (email, phone, etc.). Extracting person names, organization names, or location references requires NER models.

💡 spaCy or AWS Comprehend for NER-based extraction. Combine with SplitForge for structured pattern extraction as a pre-processing step.

Questions about limitations? Check our FAQ section below or contact us via the feedback button.

Frequently Asked Questions

How accurate is the 111K rows/second benchmark?

Why does throughput vary so much by validation mode?

What is the difference between CSV streaming mode and text paste mode?

How does validation mode affect result quality vs. speed?

What is the deduplication overhead?

Can I reproduce these benchmarks?

What file formats are supported and does format affect speed?

What happens when the 1M pattern result cap is reached?

Benchmarks last updated: February 2026. Re-tested quarterly and after major algorithm changes. Last algorithm update: February 2026 — improved IANA TLD validation set and strict mode performance by ~8%.

Ready to Extract 551K Patterns in 45 Seconds?

No installation. File contents never uploaded. 8 validated pattern types, post-extraction normalization, and a results preview — all in your browser.