Excel stops at 1,048,576 rows. Most online CSV tools crash around 500,000 rows. Python can handle millions—but requires installation, dependencies, and coding knowledge.
If your team works with CRM exports, email lists, product catalogs, transaction logs, or analytics dumps, this benchmark tells you whether your deduplication workflow belongs in Excel, Python… or the browser.
At SplitForge, we've already proven browsers can handle 15 million rows for CSV merging. For Remove Duplicates, we went further: 10GB+ files at 218–277K rows/second, with a worker heap that stays near-constant at ~100MB regardless of input size.
The architecture change that made it possible—OPFS external sort—is what this post covers.
Updated June 2026: Numbers in this post reflect the OPFS 3-pass algorithm shipped in the current Remove Duplicates worker. Previous versions of this post cited a 435K rows/sec inline hash benchmark (February 2026). That number was accurate for files below 100K rows, but understated the memory ceiling for real-world files. The OPFS path trades some single-file throughput for unlimited scale. Performance page has full detail.
TL;DR
- 10M rows deduped in ~40 seconds (218–277K rows/sec, OPFS path)
- 10GB+ validated — 200M rows in ~14 minutes with ~100MB worker heap
- Fully client-side (zero uploads)
- Outperforms Excel (hard 1M row cap) and matches Python pandas at scale
- Verified benchmark, reproducible yourself
Try it: splitforge.app/tools/remove-duplicates Full benchmark: splitforge.app/remove-duplicates-performance
Table of Contents
- The Benchmark: OPFS at 10GB Scale
- Why This Benchmark Matters
- The OPFS 3-Pass Architecture
- Technical Architecture Deep Dive
- How This Compares to Alternatives
- Real-World Use Cases
- The Privacy Advantage
- Try It Yourself
- FAQ
The Benchmark: OPFS at 10GB Scale
We deduplicated CSV files at 218–277K rows/second in a browser—with a worker heap that stays near-constant at ~100MB regardless of input size. This benchmark proves that client-side processing can handle enterprise-scale data without memory trade-offs, and without any uploads.
We tested SplitForge's Remove Duplicates tool against files ranging from 100K rows to 200M rows (10GB+).
The results:
- 10,000,000 rows processed with full duplicate detection in ~40 seconds
- 218–277K rows/second sustained OPFS throughput (browser Gate 3, Chrome stable, June 2026)
- 10GB+ validated — 200M rows in ~14 minutes, ~100MB worker heap throughout
- Zero uploads — all processing happens client-side
- 100% privacy — your data never leaves your device
This isn't a marketing claim. It's a verified benchmark you can reproduce yourself.
Full Benchmark Data
| Rows | Approx. File Size | Time | Throughput | Path | Status |
|---|---|---|---|---|---|
| 1,000 | <1MB | <0.1s | — | Inline hash | ✓ Pass |
| 100,000 | ~10MB | ~0.3s | ~333K/s | Inline hash | ✓ Pass |
| 1,000,000 | ~100MB | ~4s | ~250K/s | OPFS 3-pass | ✓ Pass |
| 5,000,000 | ~500MB | ~20s | ~250K/s | OPFS 3-pass | ✓ Pass |
| 10,000,000 | ~1GB | ~40s | ~248K/s | OPFS 3-pass | ✓ Pass |
| 28,000,000 | 1.9GB | 1.6 min | ~288K/s | OPFS (Node harness) | ✓ Pass |
| 200,000,000+ | ~10GB | ~14 min | 218–277K/s | OPFS (browser Gate 3) | ✓ Pass |
Test environment (browser): Chrome (stable), Windows 11, Intel i5-12600KF (3.70GHz), 64GB RAM, June 2026 Gate 3 validation. Node.js harness: same algorithm, Node v24.14, same hardware — 288K rows/sec at 28M rows, June 2026.
Below 100K rows the inline hash path runs at higher throughput (~333K+/s) — the OPFS overhead is not justified at that scale and the tool switches automatically.
View the full benchmark page with architecture details and memory profiles.
Why This Benchmark Matters
Excel Can't Even Open the File
Microsoft Excel has a hard limit: 1,048,576 rows maximum per worksheet. According to Microsoft's Excel specifications, this is a worksheet limitation that cannot be exceeded. If your CSV has 1,048,577 rows, Excel won't open it.
Our 10M row test file had nearly 10× Excel's capacity. Our 10GB validation run had over 190× Excel's capacity.
For data analysts, marketers, and finance teams working with large datasets, this limit isn't theoretical. It's a daily roadblock. CRM exports, email lists, transaction logs, and analytics dumps routinely exceed Excel's row limit. For a comprehensive guide to Excel's row limit, see our Excel row limit explained guide.
Online Tools Fail at Scale
We tested competitors:
- Most online CSV tools crash between 100K–500K rows
- Those that handle larger files require uploads (privacy risk)
- Server-side processing means waiting for queues
- Many impose file size limits (50MB, 100MB caps)
SplitForge processes 10GB+ locally, with no upload, no server queue, and a bounded worker heap.
Python Works—But Requires Setup
Yes, Python's pandas library can deduplicate 10 million rows. But:
- You need Python installed
- You need to install pandas (
pip install pandas) - You need to write code (not everyone can)
df.drop_duplicates()loads the entire file into RAM — memory issues above ~4GB- Processing takes 30–60 seconds on equivalent hardware for 10M rows
SplitForge requires:
- No installation
- No coding
- No command line
- Just drag, drop, click
And it handles files that would trigger a MemoryError in pandas on most machines.
The OPFS 3-Pass Architecture
Building a browser tool that processes 10GB without running out of memory required replacing the inline hash approach with an external sort algorithm backed by the Origin Private File System (OPFS).
The key insight: a browser's JavaScript heap is limited. A 10M-row Set of hashes stays manageable, but at 100M+ rows it grows to gigabytes. The OPFS path removes the heap ceiling entirely.
The algorithm runs in three passes:
Input CSV
│
▼ Pass 1 — Sort Chunks
Stream input → parse rows → build sort keys
Flush sorted 50MB chunks to OPFS storage
(Each chunk: sort(keys), write to file, clear memory)
│
▼ Pass 2 — K-Way Merge
Open all chunk files simultaneously via MinHeap
Batch-reduce if chunks > 32 (multiple merge levels)
Emit winner row index per duplicate group
Build compact bitset of survivors
│
▼ Pass 3 — Survivor Scan
Re-stream input
Check bitset for each row
Emit survivors → output CSV
Why this approach scales to 10GB:
- Pass 1: Sorts data in bounded ~50MB memory windows. Total chunks ≈ input size / 50MB. Heap usage stays near-constant.
- Pass 2: K-way merge is I/O, not memory — 32 concurrent file readers + one MinHeap. The winner bitset is tiny: ceil(rows/8) bytes (≈30MB for 200M rows).
- Pass 3: Pure streaming scan — no row accumulation.
Worker heap profile (10GB run, Chrome DevTools):
- Pass 1: ~100MB peak (one in-flight sort chunk)
- Pass 2: ~34MB (28MB bitset + OPFS I/O buffers)
- Pass 3: ~20MB (streaming only)
No OOM. No browser crash. No file size ceiling beyond available disk.
What Changed from the Old Architecture
The previous version (February 2026) used a simpler inline hash approach:
// Old inline hash path (still used below 100K rows)
for await (const line of streamCSVLines(file)) {
const hash = computeFNV1a(line);
if (!seenHashes.has(hash)) {
seenHashes.add(hash);
writeToOutput(line);
}
}
This is still the fastest path and runs for files under 100K rows. It's O(n) and does a single pass. But the seenHashes Set grows with every unique row — at 10M rows it occupies ~40–80MB. At 100M+ rows it exceeds safe V8 heap limits.
The OPFS path is ~25–30% slower for 10M rows (248K vs ~350K rows/sec) but removes the memory ceiling entirely. For 10GB files, the inline path would OOM; the OPFS path completes in ~14 minutes.
The tool selects automatically:
- Under 100K rows → inline hash (fastest, minimal overhead)
- Over 100K rows → OPFS 3-pass (bounded memory, unlimited scale)
Technical Architecture Deep Dive
Memory Management Strategy
OPFS as unbounded swap: Sort chunks are written to the Origin Private File System — a sandboxed, synchronous file storage API available in Web Workers. OPFS is not subject to the same heap limits as the JavaScript VM. Writing 50MB chunks to OPFS is equivalent to writing to local disk — no GC pressure, no V8 heap growth.
Bitset compression: The survivor index uses 1 bit per input row. For 200M rows: ceil(200M/8) = 25MB. This is always smaller than any other representation (row indices, hashes, or copies of row data).
Chunked output writing: Survivors are emitted in streaming passes with buffered writes. Output never accumulates in heap memory.
K-Way Heap Merge (Pass 2)
The MinHeap merge handles up to 32 concurrent chunk streams simultaneously. For inputs generating more than 32 chunks (any file above ~1.6GB), a batch-reduce pass runs first: batches of 32 chunks are merged into intermediate files, then the intermediate files are merged in the final pass.
For the 10GB benchmark: ~240 chunks → 8 batches of 32 → 8 intermediates → final 8-way merge. Each batch-reduce step reduces total chunk count by 31. Total I/O: approximately 3× the input file size across all three passes.
Web Workers for Non-Blocking Processing
All three passes run inside a Web Worker:
// Main thread
const worker = new Worker('deduperWorker.js');
worker.postMessage({ file, options });
worker.onmessage = ({ data }) => updateProgress(data);
// Worker thread
self.onmessage = async ({ data }) => {
await dedupOPFS(data.file, data.options);
self.postMessage({ type: 'complete' });
};
The browser stays responsive throughout. Users can switch tabs, scroll, and interact with the page even during a 14-minute 10GB run.
Performance Profiling Results
Benchmark breakdown (10M rows, ~1GB file, OPFS 3-pass path):
- Pass 1 — sort chunks: ~18s (45% of total)
- Pass 2 — k-way merge: ~8s (20% of total)
- Pass 3 — survivor scan: ~14s (35% of total)
- Total: ~40s
Bottleneck: Three full I/O passes over the input (Pass 1 reads once, Pass 3 reads again). Each pass is I/O-bound on typical laptop SSDs. Pass 2 is CPU-bound (MinHeap comparisons) but runs on already-small chunk files.
How This Compares to Alternatives
Here's how SplitForge stacks up against every alternative:
| Tool | Max Rows | Performance | Privacy | Setup Required |
|---|---|---|---|---|
| Excel | 1,048,576 (hard limit) | Crashes above limit | Local | None |
| Google Sheets | ~100K effective | Very slow at scale | Cloud-based | Google account |
| Online CSV Tools | 50K–500K typical | Upload bottleneck | Data leaves device | None |
| Python pandas | 4–8GB practical (RAM-bound) | 30–60s for 10M rows | Local | Install + coding |
| SplitForge | 200M+ verified (10GB+) | ~40s for 10M / ~14 min for 10GB | 100% client-side | None |
vs. Microsoft Excel
- Excel limit: 1,048,576 rows (hard cap per Microsoft specifications)
- SplitForge: 200,000,000+ rows validated
- Advantage: 190× more capacity
vs. Google Sheets
- Sheets limit: 10M cells total (not rows)
- Performance: Slows to crawl at 100K rows
- Advantage: 100× faster at scale
vs. Online CSV Tools
- Typical limits: 100K–500K rows, 50–100MB files
- Privacy risk: Requires uploading sensitive data
- Advantage: 400× capacity + zero uploads
vs. Python pandas
- Requires: Installation, coding, command line
- Memory:
df.drop_duplicates()loads the entire file — OOM above ~4GB on 8GB machines - Speed: 30–60s for 10M rows (similar hardware)
- Advantage: No setup + handles files that trigger pandas MemoryError
Real-World Use Cases
Marketing & CRM
Problem: Exported 5 million email addresses from multiple campaigns, need to dedupe before sending.
Traditional approach: Upload to email validation service ($200–500), wait 2–6 hours, hope they delete your data afterward.
SplitForge solution: Drop into Remove Duplicates, process in ~20 seconds, download cleaned list. GDPR-compliant (no upload to servers).
Savings: Avoid sending duplicates (reduces spam complaints), cut validation costs ($200–500/month), stay compliant (zero data exposure).
Real example: Marketing agency processed 3.2M contact list from 7 events. Found 847K duplicates. Prevented $2,400 in wasted email sends (at $0.003/email × 847K duplicates).
E-commerce & Inventory
Problem: 2 million product SKUs across Amazon, Shopify, WooCommerce—duplicates everywhere.
Traditional approach: Manual Excel sorting (hits 1M row limit), hire VA for $500–1,000 to clean manually.
SplitForge solution: Dedupe in ~8 seconds, identify duplicate listings instantly.
Savings: Hours of manual work eliminated, prevent duplicate product listings, maintain accurate inventory counts.
Finance & Compliance
Problem: 10 million transaction records need duplicate detection for SOX audit.
Traditional approach: Python script by data engineer ($2K–5K consulting fee), or upload to compliance vendor (audit risk).
SplitForge solution: Process locally (SOX compliant—no uploads), generate clean file in ~40 seconds, auditor-ready output.
Savings: Audit-ready data without privacy violations, zero consulting fees, instant processing vs days-long vendor turnaround.
Compliance note: Client-side OPFS processing means transaction data never transmitted to third parties—satisfies SOX, SOC 2, and financial data protection requirements.
Healthcare & Research
Problem: Patient records or clinical trial data with duplicates. HIPAA prohibits uploading PHI to third-party tools.
Traditional approach: Desktop software ($2K–10K license), or manual Excel work (hits row limits).
SplitForge solution: HIPAA-aligned processing (never uploaded), instant deduplication, handles files far beyond Excel's row limit.
Savings: Protect patient privacy while cleaning data, avoid expensive desktop licenses, no 1M row ceiling.
Data Analytics
Problem: Large CSV exports from databases need cleaning before analysis.
Traditional approach: Python pandas (requires coding, OOM risk on large files), or split files to fit Excel limits.
SplitForge solution: OPFS path handles files that trigger pandas MemoryError. Zero setup. Non-technical users can run it. For more techniques on processing 2+ million row datasets, see our complete guide to processing 2 million rows.
Savings: Immediate productivity, no Python installation, no file splitting required.
The Privacy Advantage
Every online CSV tool we tested requires uploading your file to their servers. That means customer emails, transaction data, and sensitive information leaves your device and sits on someone else's infrastructure.
That means:
- Your customer emails are on someone else's server
- Your transaction data is transmitted over the internet
- You're trusting a third party with sensitive information
- You're potentially violating GDPR, HIPAA, or SOX compliance
SplitForge never uploads your data.
All processing happens in your browser using Web Workers + OPFS:
- Data stays on your device (OPFS is browser-local sandboxed storage)
- No transmission over networks at any point
- No third-party access
- No compliance violations
- Instant processing (no upload wait time)
For enterprise teams, this isn't just convenient—it's a requirement. GDPR Article 32 mandates appropriate security measures for personal data processing. Client-side processing satisfies this by eliminating data transmission entirely. For a complete privacy-first CSV workflow, see our data privacy checklist.
Try It Yourself
We're not asking you to trust our benchmark. Reproduce it.
-
Generate a test file:
# Create 10M row CSV with Python (or use your own data) python -c " import csv, random, string with open('test_10m.csv', 'w', newline='') as f: w = csv.writer(f) w.writerow(['id', 'email', 'name']) for i in range(10_000_000): # ~15% duplicate rate uid = random.randint(1, 8_500_000) w.writerow([uid, f'user{uid}@example.com', f'User {uid}']) " -
Drag and drop your file
-
Click "Remove Duplicates"
-
Time it — expect ~40 seconds on modern hardware (i5-equivalent or better, SSD)
The tool is free, requires no signup, and handles files up to your available disk space.
What's Next: More Benchmarks Coming
Remove Duplicates is one of several SplitForge tools with verified large-file benchmarks. CSV Merger handles multi-file merges with no file size ceiling — tested at 42GB total input, with external sort dedup verified at 60 million rows.
All tools use the same engineering principles:
- OPFS for large-scale external sort (where needed)
- Web Workers for non-blocking processing
- Streaming architecture to prevent heap pressure
- Zero uploads, zero server dependency
Follow our Performance Hub to see all current benchmarks.
The Bigger Picture: What Modern Browsers Can Do
This benchmark proves that modern browsers are underestimated as data processing platforms. With the right architecture—OPFS, streaming APIs, and Web Workers—browsers can match or exceed server-side tools.
With the right architecture:
- OPFS stores sort intermediates without V8 heap pressure
- Streaming APIs prevent full-file-in-memory requirement
- Web Workers enable parallel processing without UI blocking
- Binary operations (Uint8Array bitsets) rival compiled languages
- No installation or dependencies required
- Universal compatibility (works on any device with a modern browser)
The "web apps are slow" narrative is outdated. The "web apps can't handle big data" narrative is dead.
When engineered properly, browsers can process 10GB of data without uploading a single byte.
FAQ
Conclusion: Engineering Over Infrastructure
You don't need cloud servers to process 10 billion rows.
You don't need desktop software installations.
You don't need to upload sensitive data to third parties.
You just need smart engineering: OPFS external sort, streaming, k-way merges, and a privacy-first architecture that keeps your data on your device.
That's what we built. That's what this benchmark proves.