Yesterday, we did something that shouldn't be possible in a browser.
We processed 15,000,000 CSV rows with full deduplication in 67.1 seconds—entirely client-side, with zero uploads, while keeping the browser responsive.
That's 223,200 rows per second. While hashing every row. While comparing millions of values. While streaming output to downloadable files.
TL;DR: We built a browser-based CSV merge tool that processes 15 million rows in 67 seconds using streaming architecture, Web Workers for background processing, and custom optimizations. The tool handles files larger than available RAM (tested up to 10GB), maintains linear scaling from 1M to 15M rows, and keeps the browser responsive throughout. It's 14× faster than Excel's row limit, competitive with pandas (which requires Python installation), and eliminates privacy risks from cloud uploads. All processing happens locally using the ReadableStream API and batched output generation—proving browsers can handle enterprise-scale data processing with the right architecture.
The browser is no longer a toy runtime. With the right architecture, it can outperform desktop utilities.
Table of Contents
- The Moment We Stopped Expecting It to Crash
- Why This Matters: Privacy vs. Performance
- The Scaling Curve (Real Benchmarks)
- The Challenge: Privacy vs. Performance
- The Architecture: Three-Layer Streaming
- The Optimizations: Why We're 3-5× Faster
- The Benchmarks: Real-World Performance
- How We Compare: Desktop Tools and Cloud Solutions
- The Privacy Advantage: Zero Trust Architecture
- What This Means for Different Users
- The Technical Takeaway: Browser Architecture Patterns
The Moment We Stopped Expecting It to Crash
When we first attempted 5 million rows, we expected the browser to hang or crash. Instead, it finished smoothly in 15 seconds with constant memory usage and a responsive UI. So we pushed harder—10 million rows. Still stable, still linear scaling.
Then 15 million rows with deduplication enabled—the ultimate stress test. We were certain it would buckle under the memory pressure of hashing and comparing millions of entries while maintaining a Set of seen values.
It didn't.
67.1 seconds later, the merged file downloaded. No crashes. No memory spikes. No UI freezes. The browser stayed responsive throughout, processing 223,200 rows per second while users could switch tabs or interact with other page elements.
That's when we knew: browser-based tools aren't just "fast for the web"—they're competitive with compiled desktop applications when architected correctly.
Why This Matters: Privacy vs. Performance
Modern datasets are too big for Excel (1,048,576 row limit explained) and too sensitive for cloud uploads. SplitForge solves both problems simultaneously using client-side processing.
For context on what 15 million rows represents:
- 14× Excel's hard limit - Excel crashes or truncates at 1M rows
- 30× most browser tools - Typical browser CSV tools choke at 500K-2M rows
- 3-5× desktop utilities - Many desktop CSV processors crash at 2-5M rows
- Faster than upload time - Most cloud solutions take longer than 67 seconds just to upload a 15M row file
The critical insight: Client-side processing eliminates the upload bottleneck. There's no network transfer overhead—just local computation.
The Scaling Curve (Real Benchmarks)
Testing environment: Chrome 120, Intel Core i7-12700K, 16GB RAM, Windows 11
| Rows (Millions) | Time (Seconds) | Throughput (rows/sec) | Memory Peak |
|---|---|---|---|
| 1 | 3.0 | 333,000 | < 600 MB |
| 5 | 15.0 | 333,000 | < 1.2 GB |
| 6.5 (w/ dedupe) | 28.6 | 227,100 | < 1.5 GB |
| 15 (w/ dedupe) | 67.1 | 223,200 | < 2 GB |
Key findings:
- Linear scaling from 1M to 15M rows - Streaming architecture prevents exponential memory growth
- Deduplication adds ~30% overhead - Hash computation and Set storage cost is predictable
- Memory stays proportional - No leaks, no exponential growth patterns
- Browser remains responsive - Web Workers keep main thread free for UI updates
The Challenge: Privacy vs. Performance
When we built SplitForge, we had one non-negotiable requirement: all data processing must happen client-side. No uploads. No servers. No exceptions.
This wasn't a nice-to-have feature. For finance teams processing transaction records, healthcare organizations handling patient data, and legal departments managing confidential documents, uploading CSV files to third-party servers creates unacceptable compliance risks. GDPR Article 28 requires data processing agreements with any third-party processor—something free online tools don't provide.
But client-side processing comes with architectural constraints:
- Limited memory - Browsers cap heap size at 2-4GB (varies by system RAM)
- Single-threaded JavaScript - Synchronous processing blocks UI, freezing the browser
- Browser APIs only - No system calls, no optimized C libraries, no GPU acceleration
- Garbage collection pauses - Large object creation triggers GC, causing stutters
Most tools solve this by requiring uploads to backend servers with more resources. We decided to solve it with streaming architecture instead.
The Architecture: Three-Layer Streaming
Our CSV Merge tool uses a three-layer streaming architecture that processes files larger than available RAM while keeping the browser responsive.
1. Streaming File Reader
Instead of loading entire files into memory (file.text()), we stream them in chunks using the ReadableStream API. This lets us process 10GB files on machines with 4GB of RAM by keeping only small portions in memory at once.
According to MDN ReadableStream documentation, the streaming approach provides:
- Backpressure handling - Consumer controls read speed
- Chunked processing - Process data incrementally
- Memory efficiency - Only current chunk in memory
async function* streamLinesFast(file) {
const reader = file.stream().getReader();
const decoder = new TextDecoder();
let buffer = '';
let inQuotes = false;
while (true) {
const { value, done } = await reader.read();
if (done) break;
buffer += decoder.decode(value, { stream: true });
// Process complete lines, keep partial line in buffer
let lineStart = 0;
let i = 0;
while (i < buffer.length) {
const c = buffer[i];
if (c === '"') inQuotes = !inQuotes;
if (!inQuotes && (c === '\n' || c === '\r')) {
const line = buffer.slice(lineStart, i);
lineStart = i + 1;
if (line.trim()) yield line;
}
i++;
}
buffer = buffer.slice(lineStart);
}
if (buffer.trim()) yield buffer;
}
This generator yields complete CSV lines without loading the full file, handling quoted fields with embedded newlines correctly.
2. Web Worker for Background Processing
All heavy computation runs in a Web Worker per the Web Workers API specification, keeping the main thread responsive. The worker handles CPU-intensive operations while the UI thread remains free for user interactions.
The worker processes:
- CSV parsing with proper quote/escape handling per RFC 4180
- Deduplication using FNV-1a hash (fast, collision-resistant)
- Output formatting with minimal escaping overhead
- Memory management with explicit garbage collection hints
Progress updates stream back to the UI via postMessage() without blocking:
// Worker posts progress every 100K rows
if (rowsProcessed % 100000 === 0) {
self.postMessage({
type: 'progress',
rowsProcessed,
totalRows: estimatedTotal
});
}
The main thread receives updates asynchronously, updating the progress bar without interrupting processing.
3. Batched Output Generation
Instead of building one giant string in memory, we create batches of ~500K rows and encode them directly to Uint8Array. This avoids massive JavaScript string allocations and speeds up Blob creation by 3-5×.
const batch = []; // Array of CSV lines
const BATCH_SIZE = 500000;
const chunks = [];
function flushBatch() {
const text = batch.join('\n') + '\n';
const encoded = new TextEncoder().encode(text);
chunks.push(encoded);
batch.length = 0; // Clear without reallocating
}
// Process rows
for (const row of parsedRows) {
batch.push(row);
if (batch.length >= BATCH_SIZE) {
flushBatch();
}
}
// Final Blob construction with pre-encoded chunks
const blob = new Blob(chunks, { type: 'text/csv' });
This is dramatically faster than new Blob([giantString]) because:
- Avoids creating multi-GB strings in memory
- TextEncoder converts directly to UTF-8 bytes
- Blob constructor concatenates binary chunks efficiently
- Garbage collector can free batches immediately after encoding
The Optimizations: Why We're 3-5× Faster
We implemented several critical optimizations that compound for massive speedups on large files.
Fast CSV Parser (3× Standard Parsers)
Standard CSV parsers prioritize correctness and edge cases over raw speed. We wrote a specialized parser that's 3× faster for our use case by eliminating unnecessary validation:
function parseCSVLineFast(line, delimiter) {
const values = [];
let current = '';
let inQuotes = false;
let i = 0;
while (i < line.length) {
const c = line[i];
if (inQuotes) {
if (c === '"' && line[i + 1] === '"') {
current += '"';
i += 2;
} else if (c === '"') {
inQuotes = false;
i++;
} else {
current += c;
i++;
}
} else {
if (c === '"') {
inQuotes = true;
i++;
} else if (c === delimiter) {
values.push(current);
current = '';
i++;
} else {
current += c;
i++;
}
}
}
values.push(current);
return values;
}
This parser handles RFC 4180 quoted fields correctly while avoiding regex overhead and unnecessary string copies.
FNV-1a Hash for Deduplication
For deduplication, we hash millions of rows for O(1) duplicate detection. FNV-1a is perfect for this use case per Fowler-Noll-Vo hash function: fast, simple implementation, good collision resistance for hash tables.
function hashRow(values) {
let hash = 2166136261; // FNV offset basis
for (let i = 0; i < values.length; i++) {
const str = values[i] == null ? '' : String(values[i]);
for (let j = 0; j < str.length; j++) {
hash ^= str.charCodeAt(j);
hash = Math.imul(hash, 16777619); // FNV prime
}
hash ^= 31; // Field separator
}
return hash >>> 0; // Convert to unsigned 32-bit
}
Stored in a JavaScript Set(), this gives us O(1) average-case duplicate detection with minimal memory overhead compared to storing full row strings.
Fast Mode (Optional 3-5× Speedup)
For users who don't need strict CSV compliance, we offer Fast Mode which uses split() instead of full parsing. This yields 3-5× speedup on clean data without quoted fields or embedded delimiters.
// Fast mode for clean CSV data
const values = line.split(delimiter);
This trades RFC 4180 compliance for raw speed—appropriate when data quality is known and trusted.
The Benchmarks: Real-World Performance
Tested with production-scale datasets on a modern laptop (16GB RAM, Chrome 120, Intel i7-12700K):
| Input Size | Rows | Mode | Time | Throughput | Memory |
|---|---|---|---|---|---|
| 10 MB | 100K | Standard | 0.3s | 333K/sec | < 200 MB |
| 50 MB | 500K | Standard | 1.5s | 333K/sec | < 400 MB |
| 100 MB | 1M | Standard | 3.0s | 333K/sec | < 600 MB |
| 500 MB | 5M | Standard | 15s | 333K/sec | < 1.2 GB |
| 1.5 GB | 15M | Dedupe ON | 67.1s | 223K/sec | < 2 GB |
Key findings:
- Consistent throughput across file sizes - Streaming architecture works as designed
- Deduplication overhead is predictable - ~30% performance cost for hash computation
- Memory stays linear - No exponential growth, no memory leaks detected
- Browser remains responsive - Can switch tabs, scroll, interact with UI throughout
- Handles files larger than RAM - 10GB files process successfully on 16GB systems
For more details on how SplitForge handles million-row CSVs in your browser, see our architecture deep-dive.
How We Compare: Desktop Tools and Cloud Solutions
We tested against major CSV processing tools using the same 15M row dataset (1.5GB file, 23 columns):
Desktop Applications:
- Excel 365: Crashes at 1,048,576 row limit (hard architectural constraint)
- Python pandas: ~45 seconds (faster but requires Python environment + pandas installation)
- SysTools CSV Splitter: Crashes on 5M+ row files, Windows-only
- Kernel CSV Splitter: Requires installation, $99 license, Windows-only
Browser Tools:
- ConvertCSV: 500K row limit, then forces upload to server
- CSV Viewer: Hangs browser at 2M rows
- Online-Convert: Requires upload, 5-10 minute server processing
Cloud Solutions:
- Gigasheet: Requires upload (network bottleneck), 5-10 minute processing
- CSVbox: 5M row limit on free tier, upload required
- Data.world: Upload required, privacy compliance concerns
SplitForge doesn't just avoid uploads—it's faster than tools that run natively on your computer (except pandas with a pre-configured Python environment). And unlike every tool listed: your data never leaves your machine.
The Privacy Advantage: Zero Trust Architecture
Because everything runs client-side per W3C File API specification, SplitForge implements true zero-trust data processing:
- No uploads: Your CSV never leaves your machine—files read locally via File API
- No servers: We can't see your data because we never receive it
- No limits: Process 100GB files if your browser can handle it (tested up to 10GB)
- Offline capable: Works without internet after initial page load (service workers cache assets)
- Audit-friendly: No data transmission = simplified GDPR/HIPAA compliance
For regulated industries (finance, healthcare, legal), this eliminates entire categories of compliance concerns:
- No data processing agreements required (no third-party processor)
- No vendor risk assessment needed
- No audit trail of data transmission
- No breach notification requirements for tool usage
The tool can't leak your data because it never has network access to it.
What This Means for Different Users
For Data Analysts
Processing capabilities Excel can't match:
- Merge multiple CSV files totaling millions of rows
- Process monthly exports without 1M row limit
- Clean datasets with deduplication in seconds
- No more "split, process chunks, manually recombine" workflows
For SMBs and Operations Teams
Enterprise capabilities without enterprise costs:
- No expensive software licenses required
- Works on any device with a modern browser
- Zero IT approval needed (no installation)
- No security risk from uploads
For Finance/Healthcare/Legal
Compliance-friendly by architecture:
- GDPR/HIPAA compliant by default (data never leaves device)
- No vendor risk assessment required
- Audit trail is local-only
- Process most sensitive datasets without exposure risk
The Technical Takeaway: Browser Architecture Patterns
Building fast browser applications requires rethinking traditional architectures. Five key patterns emerged from this project:
1. Stream everything - Don't load files into memory. Use ReadableStream API for incremental processing.
2. Use Web Workers - Keep the UI responsive by offloading CPU-intensive work to background threads.
3. Batch output - Avoid giant string allocations. Encode batches to Uint8Array and concatenate binary chunks.
4. Optimize hot paths - Custom parsers beat generic libraries for specific use cases. Profile and optimize what actually runs millions of times.
5. Test at scale - 1M rows reveals problems 100K rows don't. Performance characteristics change dramatically with file size.
People underestimate what browsers can do. The browser has evolved from a document viewer into a powerful runtime capable of serious data processing—if you architect for its strengths rather than fighting its constraints.
If you need to merge CSV files with different column structures, our guide covers advanced merge techniques including automatic column alignment and header normalization.
Performance Metrics Summary
- 15,000,000 rows processed in 67.1 seconds
- 223,200 rows/second throughput with deduplication enabled
- 26,411 duplicates removed during processing
- 100% client-side—zero uploads, zero server usage, zero network transfer
- 14× Excel's row limit handled without breaking a sweat
- < 2GB peak RAM despite processing 1.5GB source file
The browser is no longer a constraint for data processing—it's an opportunity.
Tools Referenced:
Browser APIs & Standards:
- MDN ReadableStream - Streaming file I/O
- MDN Web Workers API - Background processing
- W3C File API - Browser file handling
- FNV Hash Function - Fast hashing algorithm
Browser-Based Tools:
- CSV Merger - Client-side processing, no uploads
All browser-based tools process data entirely in your browser—no uploads, no servers, no data leaving your computer. Essential for protecting sensitive financial records, healthcare information, and confidential business data.
Building browser-based tools? Have benchmarks to share? Connect on LinkedIn or tweet results at @splitforge.