How does duplicate detection scale with file size?

Duplicate removal uses a streaming external merge-sort. In the first pass, a compact sort key for each row is written to the browser's private file system (OPFS) — never the full row. The second pass merges those keys to identify the first occurrence of each unique row. The third pass streams the original file again and writes only the kept rows, in their original order. Because only sort keys are held in memory rather than every row, deduplication completes on files too large to fit in RAM — where a traditional in-memory approach would run out of memory and crash.

Engine v2.3 · Last updated: February 26, 2026 · Intel i5-12600KF · Chrome 131 · Windows 11

Data Cleaner Performance Benchmarks

Smart Clean All (trim whitespace + remove empty rows/columns + deduplication) processes 10 million rows in 23 seconds (~435K rows/sec). Here's the full methodology, per-operation breakdown, and what affects speed on your machine.

Two performance modes: Trim Whitespace only runs at ~1.2M rows/sec (8.3s for 10M rows). Smart Clean All (trim + empty removal + deduplication) runs at ~435K rows/sec because it executes four operations in one pass. Results vary by hardware, browser, and file complexity.

Test Configuration

Hardware & Software

CPU	Intel Core i5-12600KF (10-core, 3.70GHz base / 4.9GHz boost)
RAM	64GB RAM (dual-channel)
Storage	Samsung 970 EVO NVMe SSD (read: 3,500 MB/s)
OS	Windows 11 Pro (22H2)
Browser	Chrome 131 (stable), single tab, extensions disabled
DevTools	Closed during all tests (no observer overhead)

Test File Specifications

Row count	100K · 1M · 10M (three separate files)
Columns	15 columns (mix of text, numeric, date, email)
Data type split	40% text, 30% numeric, 20% date, 10% email
File size	100K rows: ~8MB · 1M rows: ~82MB · 10M rows: ~820MB
Encoding	UTF-8, comma-delimited, CRLF line endings
Duplicates	~8% duplicate rows injected for dedup tests
NBSP injected	~5% of text cells contain non-breaking spaces
Empty rows/cols	~3% empty rows, 2 fully empty columns injected

Methodology: Each operation run 10 times. Highest and lowest values discarded. Remaining 8 runs averaged. Browser tab opened fresh before each test session. System idle (no background processes). Time measured from file drop → "Download ready" notification via performance.now() in the Web Worker.

What the Results Look Like

[Screenshot: Smart Clean All progress bar — 10M row file loading in Chrome. Shows "Processing 10,000,000 rows..." progress overlay with percentage counter and animated spinner. Browser tab title updating live.]

Smart Clean All progress overlay — UI stays fully responsive during processing

[Screenshot: Smart Clean All completion toast — "Cleaned! Removed 823K duplicates, 31K empty rows, trimmed whitespace in 23.1 seconds. 9,146,012 rows remaining." Download button prominent.]

Completion summary: rows removed, time elapsed, download prompt

[Screenshot: Chrome DevTools Network panel during active cleaning — zero requests in-flight. Filter: All. No XHR/Fetch/WebSocket. Status bar reads "0 requests". Confirms client-side-only processing.]

DevTools Network tab during cleaning: zero network requests — data never leaves browser

[Screenshot: Chrome DevTools Performance timeline — Main Thread idle (green). Web Worker thread active (purple/yellow bursts). Worker handles all processing while main thread stays free for UI interaction.]

Web Worker architecture: heavy processing on background thread, UI thread stays free

[GIF: Smart Clean All — 10M row file. Drag and drop → progress bar fills in 23s → completion toast → "9.1M rows" result displayed → Download CSV button. Entire sequence. 30 seconds, Chrome, Windows 11.]

[GIF: Dedupe by Columns — Column picker opens, "Email" checkbox selected → Apply → "823,441 duplicate emails removed in 12.1 seconds" toast. Shows real-time chip display before/after.]

[GIF: Advanced Filter + AND/OR — Add "Status = Active" filter chip, toggle AND, add "Revenue > 50000" filter chip, row count updates in real-time from 1M → 127K rows. Export filtered result.]

Memory Usage Benchmark

Peak Memory — Remove Duplicates (OPFS streaming)

1M rows (33 MB file)86 MB

Node harness with OPFS shim, May 2026

10M rows (796 MB file)200 MB

Node harness with OPFS shim, May 2026. Data grew 24× — memory grew only ~2.3×.

Data grew 24× (1M → 10M rows) while peak memory grew only ~2.3× — streaming keeps RAM bounded. Benchmark: Node harness with OPFS, May 2026. Real-browser figures may vary; the 10GB capability was verified in Chrome. Per-operation timing benchmarks are in progress and will be published in a future update.

Scalability Across File Sizes

Rows	File Size	Operation	Peak Memory	Path
1,000,000	33 MB	Remove Duplicates	86 MB	OPFS streaming
10,000,000	796 MB	Remove Duplicates	200 MB	OPFS streaming

Data Cleaner vs Alternatives (10M Rows, Smart Clean All)

Context matters. These numbers show what alternatives take for a comparable "full clean" pass on a 10M row file.

Tool	10M Row Clean (full pass)	Setup / Learning Curve	Upload Required	NBSP Detection	Regex Filtering
★ Data Cleaner	23 seconds	Zero — open browser, drop file	Never	Auto-detected	Visual builder + templates
Excel (formulas)	Crashes above ~1M rows	0 (already installed)	Never	TRIM() misses NBSP	VBA only
OpenRefine	4–8 min (Java app, local)	15–30 min install + learning	Never (local Java app)	Yes (GREL trim)	GREL expressions
Python (pandas)	45–90 sec (depends on script)	Hours (Python + pandas + script)	Never (local script)	Yes (str.strip + regex)	Full regex (str.match)
Cloud CSV tools	30–180 sec + upload time	Minutes (sign up, learn UI)	Yes — file leaves device	Varies by tool	Varies by tool

Excel and OpenRefine timings based on community benchmarks and internal testing. Python timing assumes a well-written script with vectorized pandas operations on 10M rows, 15 columns. Cloud tool timing includes file upload at 50 Mbps. All estimates for comparable "trim + dedupe + empty removal" workflow. Results vary by hardware, browser, and file complexity.

Smart Clean All: Operation Overhead Breakdown (10M rows)

Time Breakdown (Total: 23.0 seconds)

PapaParse streaming (CSV parse)9.1s (39.6%)

Reads 820MB file in chunks via streaming API. Time scales linearly with file size.

Trim whitespace (NBSP-aware regex)5.8s (25.2%)

/^[\s\u00A0]+|[\s\u00A0]+$/g applied to every cell. 15 cols × 10M rows = 150M string operations.

Remove empty rows/columns3.4s (14.8%)

Row scan (O(n×cols)) + column scan (O(cols×n)). Low overhead, very fast.

Duplicate detection (hash-based)4.7s (20.4%)

Concatenates all column values per row into a string key, stores in Set. O(n) time, O(n) memory.

Worker Architecture Details

Web Worker setup: Data Cleaner uses a dedicated Web Worker (dataCleanerWorker.worker.js) that loads PapaParse 5.3.2 via CDN importScripts and 5 modular operation handlers (parse, clean, filter, detectColumnTypes, export).

Message protocol: UI sends {id, operation, payload} envelope. Worker routes to handler via operation registry. Each handler posts progress messages every 50,000 rows and a final {type:'complete', result} message.

Memory model: PapaParse streaming parse (not bulk parse) — reads the file in chunks without loading the entire CSV string into memory first. The parsed rows array for 10M rows requires ~2–4GB RAM depending on column width. Browsers with less than 8GB available RAM may run slowly or crash on 10M row files.

Undo history: Each cleaning operation pushes the previous data array onto an undo stack. On a 10M row dataset, each undo step requires a full copy of the data (~1–2GB). Undo is limited by available RAM. Use Reset to return to the original without undo stack overhead.

Keyboard shortcuts: Ctrl+Z triggers undo via the Worker without re-parsing the file. Ctrl+S triggers the export handler which serializes data to CSV/Excel format.

When Data Cleaner Is Slower Than Expected

2–5× slower or browser crash

Low-RAM machine (under 8GB available)

Why: A 10M row dataset with 15 columns requires 2–4GB of working memory for the parsed array plus undo history. On machines with 8GB total RAM and other apps running, the OS may start swapping to disk. Symptom: progress bar stalls around 60–70%.

Fix: Close other browser tabs and apps. Try with a 1M row split first using the CSV Splitter tool.

2–8× slower filter speed

Regex filter with backtracking patterns

Why: Poorly formed regex like /.*(a+)+b/ can cause catastrophic backtracking — exponential time on large text fields. Safe patterns like email validation or ^prefix are fast.

Fix: Use the built-in regex templates (email, phone, URL, ZIP) which are pre-optimized. Test regex on 10K rows first before applying to 10M.

1.5–3× slower per column

Wide files (100+ columns)

Why: Trim whitespace and case standardize iterate over all columns. A file with 150 columns requires 10× more string operations than one with 15 columns. Smart Clean All time scales roughly linearly with column count.

Fix: Use Column Operations to select/remove columns first, then clean the reduced-column file.

1.5–2.5× slower than Chrome

Safari (macOS / iOS)

Why: Safari's JavaScript engine (JavaScriptCore) has lower Web Worker throughput than Chrome's V8 on CPU-bound string operations. PapaParse streaming parse is also slower on Safari due to different FileReader implementation.

Fix: For best performance on macOS, use Chrome or Firefox. Results vary by Safari version — newer versions (17+) are faster.

30–40× slower than CSV export

Excel (.xlsx) export for large files

Why: Excel export uses SheetJS (xlsx library) which constructs an XML-based .xlsx file in memory. Unlike CSV export (simple string concatenation), xlsx requires building a ZIP archive of multiple XML files. For 10M rows: CSV export takes ~1.9s, Excel export takes ~72s.

Fix: Export as CSV first for processing speed. Convert to Excel after using the Excel Converter if your workflow requires .xlsx.

1.2–2× slower than single filter

Filtering with AND logic and 5+ active filters

Why: AND filter applies each filter predicate in sequence — the dataset is scanned up to N times (once per filter). With 5 filters, this is 5 full passes over the dataset. OR logic is slightly faster because early exits are possible.

Fix: Reduce active filters to the minimum needed. Remove filters that do not meaningfully narrow results. Regex filters are the most expensive — place them last.

Calculate Your Annual Savings

ROI Calculator — Your Exact Savings

CSV/Excel files cleaned per week

Average rows per file

500K

Manual cleaning time per file (minutes)

45 min

Your hourly rate (USD)

$60

Manual hours/year

117 hrs

7020 total minutes

With Data Cleaner

1.3 hrs

~30s avg per file (Smart Clean All)

Annual savings

$6,942

116 hours reclaimed

Estimates based on ~30 seconds average per file with Data Cleaner (Smart Clean All at tested speeds). Actual savings depend on file size, complexity, and which operations you use. Methodology: tested on Intel i5-12600KF, 64GB RAM, Chrome, Windows 11, February 2026.

Full Test Methodology

Test Procedure

1Generate test CSV files using Python script (reproducible random seed). Inject known % of duplicates, NBSP, and empty rows/columns.
2Open Chrome fresh instance, disable all extensions, close DevTools.
3Load splitforge.app/tools/data-cleaner in a single tab.
4Drop test file into the tool. Wait for "Parse complete" signal.
5Click the operation button. Note start time via performance.now() logged in Worker.
6Wait for "complete" message. Record wall-clock time from Worker log.
7Repeat 10 times for each operation. Discard highest and lowest values.
8Average the remaining 8 values. Round to 1 decimal.
9Verify result row count against expected (known % duplicates/empty rows).
10Re-test after Chrome update if version changes.

Reproducibility

This benchmark is independently reproducible. The test file composition, Python generator script (seed 42), and methodology are documented below. Drop any 820MB, 15-column CSV with approximately 8% duplicate rows and 5% NBSP injection into Data Cleaner and run Smart Clean All — you should see results in the 20–28 second range on comparable hardware. Deviations outside that range on similar hardware indicate a performance regression and should be reported.

Test file generation: Python script with fixed random seed (42) generates reproducible test files. The 10M row test CSV is available upon request — contact via the SplitForge site.

Timing precision: Times measured via performance.now() posted from the Web Worker at operation start and operation complete. Precision: sub-millisecond. Reported to nearest 0.1 second.

What's included in timing: For "Smart Clean All" — includes parse time. For individual operations — does not include parse time (data already in memory). Export times — from button click to download dialog appearing.

Machine state during tests: No other browser tabs open. No other applications using significant CPU. System idle for 30 seconds before each test session. Tests run at ambient temperature (not during thermal throttle).

Disclaimer: Results vary by hardware, browser version, OS, available RAM, and data complexity. Wide files (100+ columns), deeply nested data, or files with many formula-like values may be slower. Mobile results typically 3–5× slower than the test hardware.

Benchmark Changelog

v2.3February 2026

Added NBSP detection to Trim Whitespace — added ~0.4s overhead at 10M rows but catches 5–12% more whitespace issues
Smart Clean All re-tested after Dedupe algorithm updated to hash-based (was sort-based in v2.2) — 14.7s → now included in Smart Clean All path at 23s total
Added per-column case transform — no measurable performance change vs all-column transform
Column picker for Replace Empty Values: negligible additional overhead (<0.1s)

v2.2November 2025

Upgrade to PapaParse 5.3.2 from 5.3.0 — ~8% faster streaming parse on Chrome
Dedupe algorithm changed from Array.sort → Set-based hashing — 2.3× faster on 10M rows (was 34s, now 14.7s)
Regex filter: pre-compile RegExp objects on filter apply (was re-compiling per row) — 3× faster regex filtering

Honest Limitations: Where Data Cleaner Falls Short

No tool is perfect for every use case. Here's where Excel might be a better choice, and the real limitations of our browser-based architecture.

Browser-Based Processing

Performance depends on your device's RAM and CPU. Modern laptops (2022+) handle 10M+ rows easily, but older devices may struggle with very large files.

Workaround:

Close unnecessary browser tabs to free up memory. For files over 50M rows, consider database solutions.

No Offline Mode (Initial Load)

Requires internet connection to load the tool initially. Processing happens offline in your browser after loading.

Workaround:

Once loaded, you can disconnect and continue processing. For true offline environments, desktop tools may be better.

Browser Tab Memory Limits

Most browsers limit individual tabs to 2-4GB RAM. This is the practical ceiling for file size.

Workaround:

Use 64-bit browsers with sufficient RAM. Chrome and Firefox handle large files best.

No Formulas or Computed Columns

Cleans existing values — it does not compute new ones. There are no formulas or calculated columns; for derived data use Excel or the VLOOKUP & Join tool.

Workaround:

Use Excel or the VLOOKUP & Join tool for derived data and calculated columns.

Flat Data Only — Formatting Not Preserved

Operates on flat tabular data. Cell formatting, colors, and multi-sheet workbook relationships are not preserved — values are extracted and cleaned, styling is dropped.

Workaround:

Export from Excel to CSV before cleaning, then re-apply formatting after if needed.

Exact-Match Deduplication Only

Deduplication is exact-match on whole rows or selected columns. Fuzzy or approximate matching (for example, treating 'Jon' and 'John' as the same) is not supported.

Workaround:

For fuzzy deduplication, use Python with pandas (fuzzywuzzy / rapidfuzz) or a dedicated data quality platform.

Very Wide Files Can Pressure Memory

Extremely wide files — thousands of columns — can still pressure memory, since each row's full column structure is held during streaming. Tall files (many rows, few columns) scale far better than wide ones.

Workaround:

Split very wide files into narrower column subsets before cleaning, or remove unused columns first. Note: the 2-4GB browser tab RAM limit applies to in-memory operations; OPFS-based deduplication uses disk space instead of RAM, which is why 10GB files complete successfully.

When to Use Excel Instead

You need formulas, pivot tables, or charts

Excel is built for computation and analysis, not bulk cleaning.

💡 Use Excel when your workflow involves formulas, pivot tables, or visual data exploration.

You're exploring data interactively rather than applying repeatable cleaning operations

Bulk repeatable cleaning is SplitForge's strength; ad-hoc exploration is better in a spreadsheet.

💡 Use Excel for interactive exploration of smaller datasets.

Your file fits comfortably in Excel (roughly under 100,000 rows)

For small files, the familiar spreadsheet interface may be faster for one-off tasks.

💡 Use Excel for small files where you prefer a point-and-click spreadsheet interface.

You need fuzzy matching, scripted pipelines, or joins across multiple datasets

Python with pandas is the stronger fit for complex data engineering workflows.

💡 Use Python with pandas for fuzzy matching, scripted pipelines, or multi-dataset joins.

Questions about limitations? Check our FAQ section below or contact us via the feedback button.

Known Limitations

Memory ceiling: ~10–15M rows

Parsing 10M rows into a JS array requires 2–4GB of browser-accessible RAM. Chrome's V8 heap limit is typically 4GB on 64-bit systems. Files above 10–15M rows (depending on column width) may cause an out-of-memory crash. Use the CSV Splitter to process in chunks.

Excel export slowdown above 5M rows

SheetJS .xlsx generation at 10M rows takes ~72 seconds due to XML/ZIP overhead. CSV export is always faster (~1.9s). For files over 5M rows, export as CSV and convert separately.

No fuzzy/phonetic deduplication

Duplicate detection uses exact string matching (after optional case normalization). "Jon Smith" and "John Smith" are treated as different records. For fuzzy deduplication, use the dedicated Remove Duplicates tool with fuzzy matching mode.

Mobile performance (tablets, phones)

Mobile CPUs process string operations 3–5× slower than desktop CPUs. Safari iOS has additional Web Worker limitations. For files over 100K rows, desktop is recommended. Mobile works well for files under 50K rows.

Performance FAQs

Why is Smart Clean All slower than individual operations?

How does duplicate detection scale — is it O(n)?

How is my file read? Is it loaded into memory all at once?

What is the maximum file size Data Cleaner can handle?

Does the UI freeze while cleaning large files?

See These Speeds on Your Own Files

Drop your messy CSV into Data Cleaner and run Smart Clean All. No signup, no upload, no wait.

Smart Clean All

23s @ 10M rows

Trim Whitespace

8.3s @ 10M rows

Dedupe by Column

12.1s @ 10M rows

Intel i5-12600KF, 64GB RAM, Chrome 131, Windows 11, Feb 2026. Results vary by hardware, browser, and file complexity.