patternExtractionPerf.faq.q1.question

patternExtractionPerf.faq.q1.answer

patternExtractionPerf.faq.q2.question

patternExtractionPerf.faq.q2.answer

patternExtractionPerf.faq.q3.question

patternExtractionPerf.faq.q3.answer

patternExtractionPerf.faq.q4.question

patternExtractionPerf.faq.q4.answer

patternExtractionPerf.faq.q5.question

patternExtractionPerf.faq.q5.answer

patternExtractionPerf.faq.q6.question

patternExtractionPerf.faq.q6.answer

patternExtractionPerf.faq.q7.question

patternExtractionPerf.faq.q7.answer

patternExtractionPerf.faq.q8.question

patternExtractionPerf.faq.q8.answer

Verified Benchmark — February 2026

Pattern Extraction Performance: 5M Rows in 45 Seconds (2026 Benchmark)

~111K rows/second (balanced mode, dataset-dependent). Emails, phones, credit cards, SSNs, URLs, dates, IP addresses, and ZIP codes — all extracted and validated in your browser with zero server transmission. Works alongside Data Masking for full PII workflows, or see the full tool overview.

~111K/s

Peak Throughput

rows/sec (balanced mode)

5M+

Maximum Tested

rows (~640MB)

Never

File Uploads

zero transmission

Pattern Types

email, phone, CC, SSN...

Benchmark Performance

All times: Chrome 131 (stable), Windows 11, Intel i5-12600KF, 64GB RAM, February 2026. 10 runs per configuration — highest/lowest discarded, remaining 8 averaged. Results vary by hardware, browser, and file density. 10M row projection not yet benchmarked — extrapolated from 5M results.

Performance at Scale

Chrome 131 (stable) · Windows 11 · Intel i5-12600KF · 64GB RAM · February 2026

Configuration	Time	Throughput	Test Notes
100K rows · 4 types · balanced	~0.8 sec	~125K rows/sec	Startup overhead visible at small sizes
1M rows · 4 types · balanced	~8.5 sec	~118K rows/sec	Comma-separated values, mixed data types
5M rows · 6 types · balanced	~45 sec	~111K rows/sec	Peak benchmark — 640MB CSV, 551K patterns found
5M rows · 8 types · strict	~58 sec	~86K rows/sec	All types + Luhn + SSA + area code validation
5M rows · 2 types · permissive	~28 sec	~179K rows/sec	Email + phone only, no deep validation
~1GB file · 4 types · balanced	~75 sec	~95K rows/sec	Maximum tested capacity, browser-dependent
2.1M rows · 4 types · balanced	~18.4 sec	~114K rows/sec	Real-world CRM export — mixed free-text notes fields, anonymized dataset, Feb 2026

Results vary by hardware, browser, number of pattern types active, validation mode, and file data density. Throughput increases with permissive mode and fewer pattern types; decreases with strict mode and all 8 types.

SplitForge vs. Python: Which Tool Fits Your Workflow?

No single tool is right for every situation. Here's an honest breakdown.

Use SplitForge when:

Data contains PII, PHI, or regulated financial information that cannot leave your device
You need Luhn, SSA, or NANP validation without writing or maintaining custom code
You want normalization (E.164, ISO 8601) in the same step as extraction
You need context view for audit trails or compliance review documentation
File is under ~1GB and the task is a one-off or periodic workflow
You don't write Python — or don't want to for this specific task
Speed of ~111K rows/sec is sufficient for your dataset size

Use Python re / pandas when:

You need to run extraction on a schedule or inside an automated data pipeline
You need custom regex patterns beyond the 8 built-in types
Files exceed 1GB or have 50M+ rows where browser memory limits apply
Extraction is part of a larger ETL or transformation workflow
You're comfortable writing and maintaining Python code
You need output piped directly into another process or database

Feature Performance Overhead

Permissive Mode (Baseline)

Baseline

~179K rows/sec

Regex matching only — no validation logic beyond the initial pattern match. Fastest throughput, highest false positive rate. Use for internal data where you plan to review all results manually, or when recall matters more than precision.

Balanced Mode (Default)

+38% time

~111K rows/sec

Applies format validation rules on top of regex matching: US area code lookup for phone numbers, date range validation (1900–2100), protocol and TLD checks for URLs. Eliminates the majority of false positives with moderate overhead. Recommended for most workflows.

Strict Mode (Deep Validation)

+108% time

~86K rows/sec

Full validation stack: Luhn algorithm for every CC candidate, SSA area/group/serial rules for SSNs, IANA TLD lookup for URLs, area code validation against full NANP allocation table. Best precision — use when every flagged match will trigger an action (notification, CRM update, manual review).

Normalization (Phone + Email + Date)

+5% time

~105K rows/sec

Post-extraction format standardization: phone to E.164 (+15551234567), email to lowercase, date to ISO 8601. Operates on already-extracted match values with O(n) string operations. Negligible impact — always worth enabling before loading results into a database or CRM.

Deduplication (by value + type)

+10–12% time

~100K rows/sec

Set-based comparison tracking unique (value, type) pairs. Each match is checked against the seen-set — first occurrence kept, duplicates removed. Adds roughly 4–6 seconds on a 5M row file. Enable when you want a distinct list of values; disable when occurrence frequency matters.

All 8 Types + Strict + Dedupe

+129% time

~78K rows/sec

Maximum-cost configuration: all 8 pattern types active, strict validation on all, deduplication enabled. Best precision and most complete results, highest processing cost. Use for full PII audits, HIPAA de-identification scans, or PCI-DSS compliance sweeps where thoroughness is the priority over speed.

All overhead figures measured on 5M row dataset, February 2026, Chrome 131 (stable), 64GB RAM, Intel i5-12600KF. Results vary by hardware, browser, data density, and which pattern types are active.

Input Format Performance

CSV (Streaming)

Recommended

~111K rows/sec

PapaParse chunk-based streaming. File is never fully loaded into memory — processes 60MB at a time. Fastest option for large files. Supports auto-detection of comma, tab, semicolon, pipe delimiters with 90%+ confidence scoring.

Excel / XLSX

Handles multi-sheet

~75K rows/sec

SheetJS reads the full file with memory-efficient flags (cellFormula: false, cellHTML: false, cellText: false) before streaming to the worker. Slower startup than CSV due to binary parsing, but handles multi-sheet workbooks with a sheet selector UI. Convert to CSV first for maximum speed on large Excel files.

Text Paste

Up to 10MB

~2–4 sec flat

Direct text or structured content pasted into the tool. No file reading overhead — processes immediately. Best for small ad-hoc extractions from clipboard content, emails, documents, or API responses. Memory-bound above ~50KB; upload a file instead for larger inputs.

Calculate Your Time Savings

Manual baseline: ~20 minutes per extraction task via Python scripting — based on internal workflow testing, February 2026. Covers: write regex per pattern type, handle encoding edge cases, test against sample data, debug false positives, run on full file, validate output counts. SplitForge processes any combination of the 8 pattern types in a single pass with validated results and normalization options in ~45 seconds (balanced mode, dataset-dependent).

Pattern types per session:

Typical: 2–4 types per task

Sessions per year:

Monthly = 12, Weekly = 52, Daily = 260

Hourly rate ($):

Analyst avg: $45–75/hr

Annual Time Saved

51.4

hours per year

Annual Labor Savings

$2,568

per year (vs. Python scripting per task)

What you eliminate:

Writing and maintaining regex for each pattern type
Debugging false positives from regex-only matching
Writing Luhn algorithm and SSA validation logic from scratch
A separate normalization pass after extraction
The compliance risk of using online tools for PII data

Testing Methodology

10 runs per config · drop high/low · report avg · test datasets available on request · re-tested quarterly

Expand

Honest Limitations: Where SplitForge Pattern Extraction Falls Short

No tool is perfect for every use case. Here's where Server-Side Entity Extraction (Python re / AWS Comprehend / spaCy) might be a better choice, and the real limitations of our browser-based architecture.

Browser-Based Processing

Performance depends on your device's RAM and CPU. Modern laptops (2022+) handle 10M+ rows easily, but older devices may struggle with very large files.

Workaround:

Close unnecessary browser tabs to free up memory. For files over 50M rows, consider database solutions.

No Offline Mode (Initial Load)

Requires internet connection to load the tool initially. Processing happens offline in your browser after loading.

Workaround:

Once loaded, you can disconnect and continue processing. For true offline environments, desktop tools may be better.

Browser Tab Memory Limits

Most browsers limit individual tabs to 2-4GB RAM. This is the practical ceiling for file size.

Workaround:

Use 64-bit browsers with sufficient RAM. Chrome and Firefox handle large files best.

Browser Memory Ceiling (~1GB / 5–10M Rows)

Maximum practical file size is ~1GB (~5–10M rows, browser-dependent). Files larger than this risk hitting browser memory limits, especially with all 8 pattern types active.

Workaround:

Split large files first using SplitForge CSV Splitter, then scan each chunk. For 50M+ row files, use Python re + pandas with chunking, or AWS Comprehend for entity extraction at pipeline scale.

No Custom Regex Patterns

Pattern Extraction supports only the 8 built-in types. User-defined regex patterns for custom schemas (account IDs, internal product codes, proprietary identifiers) are not currently supported.

Workaround:

For custom patterns, use Python re.findall() with your regex, or grep -oP for shell-based extraction. Custom regex support is a planned roadmap item.

No Automation or API

Pattern Extraction is a browser tool — no REST API, CLI, or pipeline integration. Cannot be embedded in ETL workflows, scheduled jobs, or triggered programmatically.

Workaround:

For scheduled or automated extraction, use Python re + pandas in a cron job, or AWS Comprehend for entity recognition at pipeline scale with full orchestration.

1M Pattern Result Cap

Results are capped at 1 million patterns per session. On very dense files where nearly every row contains multiple patterns, processing stops before the full scan completes.

Workaround:

Enable deduplication (unique (value, type) pairs count toward the cap, not every occurrence), or narrow column selection to target only columns most likely to contain the patterns you need.

Single File Per Session

Pattern Extraction scans one file at a time. No batch processing across multiple files in a single operation.

Workaround:

Process files sequentially. For high-volume batch scanning (50+ files), use Python re in a loop or a shell script with grep -oP across all files.

When to Use Server-Side Entity Extraction (Python re / AWS Comprehend / spaCy) Instead

You need pattern extraction in an automated pipeline or scheduled job

SplitForge has no API. Browser-only workflow cannot run on a schedule or be triggered programmatically.

💡 Python re + pandas in a cron job, or AWS Comprehend for entity recognition in a Lambda function.

You need custom regex patterns for proprietary data schemas

SplitForge supports only the 8 built-in types. Custom schemas require user-defined regex.

💡 Python re.findall() with your pattern, or grep -oP for shell-based extraction.

You need to scan 50M+ row files

Browser memory limits practical ceiling to ~5–10M rows. Server-side tools scale horizontally.

💡 Python re + pandas with chunking (chunksize parameter), or AWS Comprehend for large-scale entity extraction.

You need named entity recognition beyond structured patterns

SplitForge extracts structured patterns (email, phone, etc.). Extracting person names, organization names, or location references requires NER models.

💡 spaCy or AWS Comprehend for NER-based extraction. Combine with SplitForge for structured pattern extraction as a pre-processing step.

Questions about limitations? Check our FAQ section below or contact us via the feedback button.

Frequently Asked Questions

How accurate is the 111K rows/second benchmark?

Why does throughput vary so much by validation mode?

What is the difference between CSV streaming mode and text paste mode?

How does validation mode affect result quality vs. speed?

What is the deduplication overhead?

Can I reproduce these benchmarks?

What file formats are supported and does format affect speed?

What happens when the 1M pattern result cap is reached?

Benchmarks last updated: February 2026. Re-tested quarterly and after major algorithm changes. Last algorithm update: February 2026 — improved IANA TLD validation set and strict mode performance by ~8%.

Ready to Extract 551K Patterns in 45 Seconds?

No installation. File contents never uploaded. 8 validated pattern types, post-extraction normalization, and a results preview — all in your browser.

Also try: Data Masking · Data Validator · Remove Duplicates · CSV Splitter