Navigated to blog › remove-duplicates-benchmark-10m-rows
Back to Blog
Performance

10M Rows Deduplicated in ~40 Seconds: OPFS Browser Deduplication at 10GB+ Scale

December 10, 2025
14
By Splitforge Team

Excel stops at 1,048,576 rows. Most online CSV tools crash around 500,000 rows. Python can handle millions—but requires installation, dependencies, and coding knowledge.

If your team works with CRM exports, email lists, product catalogs, transaction logs, or analytics dumps, this benchmark tells you whether your deduplication workflow belongs in Excel, Python… or the browser.

At SplitForge, we've already proven browsers can handle 15 million rows for CSV merging. For Remove Duplicates, we went further: 10GB+ files at 218–277K rows/second, with a worker heap that stays near-constant at ~100MB regardless of input size.

The architecture change that made it possible—OPFS external sort—is what this post covers.

Updated June 2026: Numbers in this post reflect the OPFS 3-pass algorithm shipped in the current Remove Duplicates worker. Previous versions of this post cited a 435K rows/sec inline hash benchmark (February 2026). That number was accurate for files below 100K rows, but understated the memory ceiling for real-world files. The OPFS path trades some single-file throughput for unlimited scale. Performance page has full detail.

TL;DR

  • 10M rows deduped in ~40 seconds (218–277K rows/sec, OPFS path)
  • 10GB+ validated — 200M rows in ~14 minutes with ~100MB worker heap
  • Fully client-side (zero uploads)
  • Outperforms Excel (hard 1M row cap) and matches Python pandas at scale
  • Verified benchmark, reproducible yourself

Try it: splitforge.app/tools/remove-duplicates Full benchmark: splitforge.app/remove-duplicates-performance


Table of Contents


The Benchmark: OPFS at 10GB Scale

We deduplicated CSV files at 218–277K rows/second in a browser—with a worker heap that stays near-constant at ~100MB regardless of input size. This benchmark proves that client-side processing can handle enterprise-scale data without memory trade-offs, and without any uploads.

We tested SplitForge's Remove Duplicates tool against files ranging from 100K rows to 200M rows (10GB+).

The results:

  • 10,000,000 rows processed with full duplicate detection in ~40 seconds
  • 218–277K rows/second sustained OPFS throughput (browser Gate 3, Chrome stable, June 2026)
  • 10GB+ validated — 200M rows in ~14 minutes, ~100MB worker heap throughout
  • Zero uploads — all processing happens client-side
  • 100% privacy — your data never leaves your device

This isn't a marketing claim. It's a verified benchmark you can reproduce yourself.

Full Benchmark Data

RowsApprox. File SizeTimeThroughputPathStatus
1,000<1MB<0.1sInline hash✓ Pass
100,000~10MB~0.3s~333K/sInline hash✓ Pass
1,000,000~100MB~4s~250K/sOPFS 3-pass✓ Pass
5,000,000~500MB~20s~250K/sOPFS 3-pass✓ Pass
10,000,000~1GB~40s~248K/sOPFS 3-pass✓ Pass
28,000,0001.9GB1.6 min~288K/sOPFS (Node harness)✓ Pass
200,000,000+~10GB~14 min218–277K/sOPFS (browser Gate 3)✓ Pass

Test environment (browser): Chrome (stable), Windows 11, Intel i5-12600KF (3.70GHz), 64GB RAM, June 2026 Gate 3 validation. Node.js harness: same algorithm, Node v24.14, same hardware — 288K rows/sec at 28M rows, June 2026.

Below 100K rows the inline hash path runs at higher throughput (~333K+/s) — the OPFS overhead is not justified at that scale and the tool switches automatically.

View the full benchmark page with architecture details and memory profiles.


Why This Benchmark Matters

Excel Can't Even Open the File

Microsoft Excel has a hard limit: 1,048,576 rows maximum per worksheet. According to Microsoft's Excel specifications, this is a worksheet limitation that cannot be exceeded. If your CSV has 1,048,577 rows, Excel won't open it.

Our 10M row test file had nearly 10× Excel's capacity. Our 10GB validation run had over 190× Excel's capacity.

For data analysts, marketers, and finance teams working with large datasets, this limit isn't theoretical. It's a daily roadblock. CRM exports, email lists, transaction logs, and analytics dumps routinely exceed Excel's row limit. For a comprehensive guide to Excel's row limit, see our Excel row limit explained guide.

Online Tools Fail at Scale

We tested competitors:

  • Most online CSV tools crash between 100K–500K rows
  • Those that handle larger files require uploads (privacy risk)
  • Server-side processing means waiting for queues
  • Many impose file size limits (50MB, 100MB caps)

SplitForge processes 10GB+ locally, with no upload, no server queue, and a bounded worker heap.

Python Works—But Requires Setup

Yes, Python's pandas library can deduplicate 10 million rows. But:

  • You need Python installed
  • You need to install pandas (pip install pandas)
  • You need to write code (not everyone can)
  • df.drop_duplicates() loads the entire file into RAM — memory issues above ~4GB
  • Processing takes 30–60 seconds on equivalent hardware for 10M rows

SplitForge requires:

  • No installation
  • No coding
  • No command line
  • Just drag, drop, click

And it handles files that would trigger a MemoryError in pandas on most machines.


The OPFS 3-Pass Architecture

Building a browser tool that processes 10GB without running out of memory required replacing the inline hash approach with an external sort algorithm backed by the Origin Private File System (OPFS).

The key insight: a browser's JavaScript heap is limited. A 10M-row Set of hashes stays manageable, but at 100M+ rows it grows to gigabytes. The OPFS path removes the heap ceiling entirely.

The algorithm runs in three passes:

Input CSV
  │
  ▼ Pass 1 — Sort Chunks
  Stream input → parse rows → build sort keys
  Flush sorted 50MB chunks to OPFS storage
  (Each chunk: sort(keys), write to file, clear memory)
  │
  ▼ Pass 2 — K-Way Merge
  Open all chunk files simultaneously via MinHeap
  Batch-reduce if chunks > 32 (multiple merge levels)
  Emit winner row index per duplicate group
  Build compact bitset of survivors
  │
  ▼ Pass 3 — Survivor Scan
  Re-stream input
  Check bitset for each row
  Emit survivors → output CSV

Why this approach scales to 10GB:

  • Pass 1: Sorts data in bounded ~50MB memory windows. Total chunks ≈ input size / 50MB. Heap usage stays near-constant.
  • Pass 2: K-way merge is I/O, not memory — 32 concurrent file readers + one MinHeap. The winner bitset is tiny: ceil(rows/8) bytes (≈30MB for 200M rows).
  • Pass 3: Pure streaming scan — no row accumulation.

Worker heap profile (10GB run, Chrome DevTools):

  • Pass 1: ~100MB peak (one in-flight sort chunk)
  • Pass 2: ~34MB (28MB bitset + OPFS I/O buffers)
  • Pass 3: ~20MB (streaming only)

No OOM. No browser crash. No file size ceiling beyond available disk.

What Changed from the Old Architecture

The previous version (February 2026) used a simpler inline hash approach:

// Old inline hash path (still used below 100K rows)
for await (const line of streamCSVLines(file)) {
  const hash = computeFNV1a(line);
  if (!seenHashes.has(hash)) {
    seenHashes.add(hash);
    writeToOutput(line);
  }
}

This is still the fastest path and runs for files under 100K rows. It's O(n) and does a single pass. But the seenHashes Set grows with every unique row — at 10M rows it occupies ~40–80MB. At 100M+ rows it exceeds safe V8 heap limits.

The OPFS path is ~25–30% slower for 10M rows (248K vs ~350K rows/sec) but removes the memory ceiling entirely. For 10GB files, the inline path would OOM; the OPFS path completes in ~14 minutes.

The tool selects automatically:

  • Under 100K rows → inline hash (fastest, minimal overhead)
  • Over 100K rows → OPFS 3-pass (bounded memory, unlimited scale)

Technical Architecture Deep Dive

Memory Management Strategy

OPFS as unbounded swap: Sort chunks are written to the Origin Private File System — a sandboxed, synchronous file storage API available in Web Workers. OPFS is not subject to the same heap limits as the JavaScript VM. Writing 50MB chunks to OPFS is equivalent to writing to local disk — no GC pressure, no V8 heap growth.

Bitset compression: The survivor index uses 1 bit per input row. For 200M rows: ceil(200M/8) = 25MB. This is always smaller than any other representation (row indices, hashes, or copies of row data).

Chunked output writing: Survivors are emitted in streaming passes with buffered writes. Output never accumulates in heap memory.

K-Way Heap Merge (Pass 2)

The MinHeap merge handles up to 32 concurrent chunk streams simultaneously. For inputs generating more than 32 chunks (any file above ~1.6GB), a batch-reduce pass runs first: batches of 32 chunks are merged into intermediate files, then the intermediate files are merged in the final pass.

For the 10GB benchmark: ~240 chunks → 8 batches of 32 → 8 intermediates → final 8-way merge. Each batch-reduce step reduces total chunk count by 31. Total I/O: approximately 3× the input file size across all three passes.

Web Workers for Non-Blocking Processing

All three passes run inside a Web Worker:

// Main thread
const worker = new Worker('deduperWorker.js');
worker.postMessage({ file, options });
worker.onmessage = ({ data }) => updateProgress(data);

// Worker thread
self.onmessage = async ({ data }) => {
  await dedupOPFS(data.file, data.options);
  self.postMessage({ type: 'complete' });
};

The browser stays responsive throughout. Users can switch tabs, scroll, and interact with the page even during a 14-minute 10GB run.

Performance Profiling Results

Benchmark breakdown (10M rows, ~1GB file, OPFS 3-pass path):

  • Pass 1 — sort chunks: ~18s (45% of total)
  • Pass 2 — k-way merge: ~8s (20% of total)
  • Pass 3 — survivor scan: ~14s (35% of total)
  • Total: ~40s

Bottleneck: Three full I/O passes over the input (Pass 1 reads once, Pass 3 reads again). Each pass is I/O-bound on typical laptop SSDs. Pass 2 is CPU-bound (MinHeap comparisons) but runs on already-small chunk files.


How This Compares to Alternatives

Here's how SplitForge stacks up against every alternative:

ToolMax RowsPerformancePrivacySetup Required
Excel1,048,576 (hard limit)Crashes above limitLocalNone
Google Sheets~100K effectiveVery slow at scaleCloud-basedGoogle account
Online CSV Tools50K–500K typicalUpload bottleneckData leaves deviceNone
Python pandas4–8GB practical (RAM-bound)30–60s for 10M rowsLocalInstall + coding
SplitForge200M+ verified (10GB+)~40s for 10M / ~14 min for 10GB100% client-sideNone

vs. Microsoft Excel

  • Excel limit: 1,048,576 rows (hard cap per Microsoft specifications)
  • SplitForge: 200,000,000+ rows validated
  • Advantage: 190× more capacity

vs. Google Sheets

  • Sheets limit: 10M cells total (not rows)
  • Performance: Slows to crawl at 100K rows
  • Advantage: 100× faster at scale

vs. Online CSV Tools

  • Typical limits: 100K–500K rows, 50–100MB files
  • Privacy risk: Requires uploading sensitive data
  • Advantage: 400× capacity + zero uploads

vs. Python pandas

  • Requires: Installation, coding, command line
  • Memory: df.drop_duplicates() loads the entire file — OOM above ~4GB on 8GB machines
  • Speed: 30–60s for 10M rows (similar hardware)
  • Advantage: No setup + handles files that trigger pandas MemoryError

Real-World Use Cases

Marketing & CRM

Problem: Exported 5 million email addresses from multiple campaigns, need to dedupe before sending.

Traditional approach: Upload to email validation service ($200–500), wait 2–6 hours, hope they delete your data afterward.

SplitForge solution: Drop into Remove Duplicates, process in ~20 seconds, download cleaned list. GDPR-compliant (no upload to servers).

Savings: Avoid sending duplicates (reduces spam complaints), cut validation costs ($200–500/month), stay compliant (zero data exposure).

Real example: Marketing agency processed 3.2M contact list from 7 events. Found 847K duplicates. Prevented $2,400 in wasted email sends (at $0.003/email × 847K duplicates).

E-commerce & Inventory

Problem: 2 million product SKUs across Amazon, Shopify, WooCommerce—duplicates everywhere.

Traditional approach: Manual Excel sorting (hits 1M row limit), hire VA for $500–1,000 to clean manually.

SplitForge solution: Dedupe in ~8 seconds, identify duplicate listings instantly.

Savings: Hours of manual work eliminated, prevent duplicate product listings, maintain accurate inventory counts.

Finance & Compliance

Problem: 10 million transaction records need duplicate detection for SOX audit.

Traditional approach: Python script by data engineer ($2K–5K consulting fee), or upload to compliance vendor (audit risk).

SplitForge solution: Process locally (SOX compliant—no uploads), generate clean file in ~40 seconds, auditor-ready output.

Savings: Audit-ready data without privacy violations, zero consulting fees, instant processing vs days-long vendor turnaround.

Compliance note: Client-side OPFS processing means transaction data never transmitted to third parties—satisfies SOX, SOC 2, and financial data protection requirements.

Healthcare & Research

Problem: Patient records or clinical trial data with duplicates. HIPAA prohibits uploading PHI to third-party tools.

Traditional approach: Desktop software ($2K–10K license), or manual Excel work (hits row limits).

SplitForge solution: HIPAA-aligned processing (never uploaded), instant deduplication, handles files far beyond Excel's row limit.

Savings: Protect patient privacy while cleaning data, avoid expensive desktop licenses, no 1M row ceiling.

Data Analytics

Problem: Large CSV exports from databases need cleaning before analysis.

Traditional approach: Python pandas (requires coding, OOM risk on large files), or split files to fit Excel limits.

SplitForge solution: OPFS path handles files that trigger pandas MemoryError. Zero setup. Non-technical users can run it. For more techniques on processing 2+ million row datasets, see our complete guide to processing 2 million rows.

Savings: Immediate productivity, no Python installation, no file splitting required.


The Privacy Advantage

Every online CSV tool we tested requires uploading your file to their servers. That means customer emails, transaction data, and sensitive information leaves your device and sits on someone else's infrastructure.

That means:

  • Your customer emails are on someone else's server
  • Your transaction data is transmitted over the internet
  • You're trusting a third party with sensitive information
  • You're potentially violating GDPR, HIPAA, or SOX compliance

SplitForge never uploads your data.

All processing happens in your browser using Web Workers + OPFS:

  • Data stays on your device (OPFS is browser-local sandboxed storage)
  • No transmission over networks at any point
  • No third-party access
  • No compliance violations
  • Instant processing (no upload wait time)

For enterprise teams, this isn't just convenient—it's a requirement. GDPR Article 32 mandates appropriate security measures for personal data processing. Client-side processing satisfies this by eliminating data transmission entirely. For a complete privacy-first CSV workflow, see our data privacy checklist.


Try It Yourself

We're not asking you to trust our benchmark. Reproduce it.

  1. Generate a test file:

    # Create 10M row CSV with Python (or use your own data)
    python -c "
    import csv, random, string
    with open('test_10m.csv', 'w', newline='') as f:
        w = csv.writer(f)
        w.writerow(['id', 'email', 'name'])
        for i in range(10_000_000):
            # ~15% duplicate rate
            uid = random.randint(1, 8_500_000)
            w.writerow([uid, f'user{uid}@example.com', f'User {uid}'])
    "
    
  2. Visit splitforge.app/tools/remove-duplicates

  3. Drag and drop your file

  4. Click "Remove Duplicates"

  5. Time it — expect ~40 seconds on modern hardware (i5-equivalent or better, SSD)

The tool is free, requires no signup, and handles files up to your available disk space.


What's Next: More Benchmarks Coming

Remove Duplicates is one of several SplitForge tools with verified large-file benchmarks. CSV Merger handles multi-file merges with no file size ceiling — tested at 42GB total input, with external sort dedup verified at 60 million rows.

All tools use the same engineering principles:

  • OPFS for large-scale external sort (where needed)
  • Web Workers for non-blocking processing
  • Streaming architecture to prevent heap pressure
  • Zero uploads, zero server dependency

Follow our Performance Hub to see all current benchmarks.


The Bigger Picture: What Modern Browsers Can Do

This benchmark proves that modern browsers are underestimated as data processing platforms. With the right architecture—OPFS, streaming APIs, and Web Workers—browsers can match or exceed server-side tools.

With the right architecture:

  • OPFS stores sort intermediates without V8 heap pressure
  • Streaming APIs prevent full-file-in-memory requirement
  • Web Workers enable parallel processing without UI blocking
  • Binary operations (Uint8Array bitsets) rival compiled languages
  • No installation or dependencies required
  • Universal compatibility (works on any device with a modern browser)

The "web apps are slow" narrative is outdated. The "web apps can't handle big data" narrative is dead.

When engineered properly, browsers can process 10GB of data without uploading a single byte.


FAQ

Yes. The OPFS 3-pass algorithm keeps worker heap near-constant at ~100MB regardless of file size. 10M rows completes in ~40 seconds. 10GB+ (200M rows) validated in browser Gate 3 testing at 218–277K rows/sec.

For 10M rows, roughly comparable (~40s vs 30–60s). For files above ~4GB, SplitForge has an advantage: the OPFS path handles files that would trigger a MemoryError in pandas on 8–16GB RAM machines.

Yes. All processing happens client-side using Web Workers and OPFS (browser-local sandboxed storage). Files never upload. No data transmission occurs at any point. GDPR, HIPAA, and SOX compliant by architecture.

No fixed memory ceiling on the exact match path. Practical limit is available disk space (OPFS uses ~1.3× input size for sort intermediates). Validated at 10GB+ in browser. For fuzzy matching mode, split very large files first.

Exact match uses sort-key comparison in the OPFS path — collision-free by construction. Fuzzy matching uses Jaro-Winkler with configurable thresholds and 12 presets.

Yes. We've validated 200M+ rows (190× Excel's limit) in browser Gate 3 testing. No artificial row limits.

Processing stops and OPFS temporary files are cleaned up. Re-upload the file to restart. Keep the tab active for large files.

Three full I/O passes over the data: Pass 1 (sort chunks), Pass 2 (merge + bitset), Pass 3 (survivor scan). The figure reflects total rows / total wall time across all three passes — each byte of input is read twice, plus the sort intermediates written and read once. Optimizing further would require reducing to two passes, which is a future engineering goal.

Hitting Excel's row limit or file size issues? See our complete guide: Excel Row Limit & Large File Solutions (2026)



Conclusion: Engineering Over Infrastructure

You don't need cloud servers to process 10 billion rows.

You don't need desktop software installations.

You don't need to upload sensitive data to third parties.

You just need smart engineering: OPFS external sort, streaming, k-way merges, and a privacy-first architecture that keeps your data on your device.

That's what we built. That's what this benchmark proves.

Process 10GB+ Without Uploads—Start Now

Deduplicate 10 million rows in ~40 seconds
10GB+ scale — 200M+ rows validated with ~100MB worker heap
190× Excel's row limit with zero setup required
Zero uploads — your data never leaves your browser
Verified benchmark: 218–277K rows/second (OPFS, browser Gate 3, June 2026)

Continue Reading

More guides to help you work smarter with your data

ai-data-prep

AI-Ready Data Checklist: 10 Things to Verify Before Upload (2026)

Before uploading to ChatGPT, Claude, or a fine-tuning API, run through this 10-point checklist. UTF-8 encoding, clean headers, PII removed, size within limits.

Read More
ai-data-prep

Convert Excel to JSON for AI APIs and LLM Pipelines (2026)

AI APIs and LLM pipelines expect JSON, not spreadsheets. Fine-tuning needs JSONL; direct prompts take arrays. Convert locally — no upload, no conversion server.

Read More
ai-data-prep

Prepare Data for AI: The Complete Guide (Privacy-First, 2026)

How to prepare a CSV or Excel file for ChatGPT, Claude, or an AI API — encoding, PII, format, size, and privacy. The complete local-first prep workflow.

Read More