Data Cleaner Performance Benchmarks
Smart Clean All (trim whitespace + remove empty rows/columns + deduplication) processes 10 million rows in 23 seconds (~435K rows/sec). Here's the full methodology, per-operation breakdown, and what affects speed on your machine.
Test Configuration
Hardware & Software
| CPU | Intel Core i7-12700K (12-core, 3.6GHz base / 5.0GHz boost) |
| RAM | 32GB DDR4-3200 (dual-channel) |
| Storage | Samsung 970 EVO NVMe SSD (read: 3,500 MB/s) |
| OS | Windows 11 Pro (22H2) |
| Browser | Chrome 131 (stable), single tab, extensions disabled |
| DevTools | Closed during all tests (no observer overhead) |
Test File Specifications
| Row count | 100K · 1M · 10M (three separate files) |
| Columns | 15 columns (mix of text, numeric, date, email) |
| Data type split | 40% text, 30% numeric, 20% date, 10% email |
| File size | 100K rows: ~8MB · 1M rows: ~82MB · 10M rows: ~820MB |
| Encoding | UTF-8, comma-delimited, CRLF line endings |
| Duplicates | ~8% duplicate rows injected for dedup tests |
| NBSP injected | ~5% of text cells contain non-breaking spaces |
| Empty rows/cols | ~3% empty rows, 2 fully empty columns injected |
performance.now() in the Web Worker.What the Results Look Like
Operation Benchmarks — 10M Row File
🧹 Cleaning Operations (10M rows)
🔎 Filter Operations (10M rows)
Scalability Across File Sizes
Data Cleaner vs Alternatives (10M Rows, Smart Clean All)
Context matters. These numbers show what alternatives take for a comparable "full clean" pass on a 10M row file.
Smart Clean All: Operation Overhead Breakdown (10M rows)
Time Breakdown (Total: 23.0 seconds)
Worker Architecture Details
Web Worker setup: Data Cleaner uses a dedicated Web Worker (dataCleanerWorker.worker.js) that loads PapaParse 5.3.2 via CDN importScripts and 5 modular operation handlers (parse, clean, filter, detectColumnTypes, export).
Message protocol: UI sends {id, operation, payload} envelope. Worker routes to handler via operation registry. Each handler posts progress messages every 50,000 rows and a final{type:'complete', result} message.
Memory model: PapaParse streaming parse (not bulk parse) — reads the file in chunks without loading the entire CSV string into memory first. The parsed rows array for 10M rows requires ~2–4GB RAM depending on column width. Browsers with less than 8GB available RAM may run slowly or crash on 10M row files.
Undo history: Each cleaning operation pushes the previous data array onto an undo stack. On a 10M row dataset, each undo step requires a full copy of the data (~1–2GB). Undo is limited by available RAM. Use Reset to return to the original without undo stack overhead.
Keyboard shortcuts: Ctrl+Z triggers undo via the Worker without re-parsing the file. Ctrl+S triggers the export handler which serializes data to CSV/Excel format.
When Data Cleaner Is Slower Than Expected
Low-RAM machine (under 8GB available)
Why: A 10M row dataset with 15 columns requires 2–4GB of working memory for the parsed array plus undo history. On machines with 8GB total RAM and other apps running, the OS may start swapping to disk. Symptom: progress bar stalls around 60–70%.
Regex filter with backtracking patterns
Why: Poorly formed regex like /.*(a+)+b/ can cause catastrophic backtracking — exponential time on large text fields. Safe patterns like email validation or ^prefix are fast.
Wide files (100+ columns)
Why: Trim whitespace and case standardize iterate over all columns. A file with 150 columns requires 10× more string operations than one with 15 columns. Smart Clean All time scales roughly linearly with column count.
Safari (macOS / iOS)
Why: Safari's JavaScript engine (JavaScriptCore) has lower Web Worker throughput than Chrome's V8 on CPU-bound string operations. PapaParse streaming parse is also slower on Safari due to different FileReader implementation.
Excel (.xlsx) export for large files
Why: Excel export uses SheetJS (xlsx library) which constructs an XML-based .xlsx file in memory. Unlike CSV export (simple string concatenation), xlsx requires building a ZIP archive of multiple XML files. For 10M rows: CSV export takes ~1.9s, Excel export takes ~72s.
Filtering with AND logic and 5+ active filters
Why: AND filter applies each filter predicate in sequence — the dataset is scanned up to N times (once per filter). With 5 filters, this is 5 full passes over the dataset. OR logic is slightly faster because early exits are possible.
Calculate Your Annual Savings
ROI Calculator — Your Exact Savings
Full Test Methodology
Test Procedure
- 1Generate test CSV files using Python script (reproducible random seed). Inject known % of duplicates, NBSP, and empty rows/columns.
- 2Open Chrome fresh instance, disable all extensions, close DevTools.
- 3Load splitforge.app/tools/data-cleaner in a single tab.
- 4Drop test file into the tool. Wait for "Parse complete" signal.
- 5Click the operation button. Note start time via performance.now() logged in Worker.
- 6Wait for "complete" message. Record wall-clock time from Worker log.
- 7Repeat 10 times for each operation. Discard highest and lowest values.
- 8Average the remaining 8 values. Round to 1 decimal.
- 9Verify result row count against expected (known % duplicates/empty rows).
- 10Re-test after Chrome update if version changes.
Reproducibility
Test file generation: Python script with fixed random seed (42) generates reproducible test files. The 10M row test CSV is available upon request — contact via the SplitForge site.
Timing precision: Times measured via performance.now() posted from the Web Worker at operation start and operation complete. Precision: sub-millisecond. Reported to nearest 0.1 second.
What's included in timing: For "Smart Clean All" — includes parse time. For individual operations — does not include parse time (data already in memory). Export times — from button click to download dialog appearing.
Machine state during tests: No other browser tabs open. No other applications using significant CPU. System idle for 30 seconds before each test session. Tests run at ambient temperature (not during thermal throttle).
Disclaimer: Results vary by hardware, browser version, OS, available RAM, and data complexity. Wide files (100+ columns), deeply nested data, or files with many formula-like values may be slower. Mobile results typically 3–5× slower than the test hardware.
Benchmark Changelog
- Added NBSP detection to Trim Whitespace — added ~0.4s overhead at 10M rows but catches 5–12% more whitespace issues
- Smart Clean All re-tested after Dedupe algorithm updated to hash-based (was sort-based in v2.2) — 14.7s → now included in Smart Clean All path at 23s total
- Added per-column case transform — no measurable performance change vs all-column transform
- Column picker for Replace Empty Values: negligible additional overhead (<0.1s)
- Upgrade to PapaParse 5.3.2 from 5.3.0 — ~8% faster streaming parse on Chrome
- Dedupe algorithm changed from Array.sort → Set-based hashing — 2.3× faster on 10M rows (was 34s, now 14.7s)
- Regex filter: pre-compile RegExp objects on filter apply (was re-compiling per row) — 3× faster regex filtering
Known Limitations
Parsing 10M rows into a JS array requires 2–4GB of browser-accessible RAM. Chrome's V8 heap limit is typically 4GB on 64-bit systems. Files above 10–15M rows (depending on column width) may cause an out-of-memory crash. Use the CSV Splitter to process in chunks.
SheetJS .xlsx generation at 10M rows takes ~72 seconds due to XML/ZIP overhead. CSV export is always faster (~1.9s). For files over 5M rows, export as CSV and convert separately.
Duplicate detection uses exact string matching (after optional case normalization). "Jon Smith" and "John Smith" are treated as different records. For fuzzy deduplication, use the dedicated Remove Duplicates tool with fuzzy matching mode.
Mobile CPUs process string operations 3–5× slower than desktop CPUs. Safari iOS has additional Web Worker limitations. For files over 100K rows, desktop is recommended. Mobile works well for files under 50K rows.
Performance FAQs
Why is Smart Clean All slower than individual operations?
How does duplicate detection scale — is it O(n)?
How is my file read? Is it loaded into memory all at once?
What is the maximum file size Data Cleaner can handle?
Does the UI freeze while cleaning large files?
See These Speeds on Your Own Files
Drop your messy CSV into Data Cleaner and run Smart Clean All. No signup, no upload, no wait.