Production-Tested — February 2026

5 Million CSV Rows Joined
in 23.6 Seconds (join time, excludes optional analysis step)

212K rows/second (inner join, dataset-dependent). Hash-based O(n+m) algorithm. 7 join types, composite keys, explosion detection — all in your browser with zero uploads. CSV Merger for full data preparation workflows, or see the full tool overview.

~333K/s

Peak Throughput

rows/sec at 1M rows

5M+

Maximum Tested

rows (~900MB combined, 64GB RAM)

Never

File Uploads

zero transmission

Join Types

incl. Anti & Semi

Benchmark Performance

What these numbers measure: SplitForge times are compute benchmarks — milliseconds from worker init to download prompt, averaged across 10 runs (high/low discarded). Chrome (stable), Windows 11, Intel i5-12600KF, 64GB RAM, February 2026. Results vary by hardware, browser, join type, and match rate (±15–20%).

Excel 900s is a workflow estimate, not a compute benchmark — it represents a typical end-to-end VLOOKUP workflow (write formula, drag down, IFERROR wrapper, copy-paste as values, troubleshoot mismatches) from internal testing, February 2026. Actual workflow time varies by user familiarity and file complexity. These are not directly comparable numbers — the intent is to show why the tool context matters, not raw compute speed.

Performance at Scale

Chrome (stable) · Windows 11 · Intel i5-12600KF · 64GB RAM · February 2026

File Size	Inner Join	Left Join	Test Notes
100K rows	~296K rows/sec	~280K rows/sec	Startup overhead visible at small sizes; hash map build dominates
500K rows	~310K rows/sec	~295K rows/sec	Mixed data types, single key column, 85% match rate
1M rows	~333K rows/sec	~315K rows/sec	Inner join with duplicates; 90% match rate
2M rows	~290K rows/sec	~270K rows/sec	Larger right-side hash table, more GC pressure
5M rows	~212K rows/sec	~195K rows/sec	Verified stress test — 5M left × 4.5M right, 556MB output
10M (est.)	~208K rows/sec	~190K rows/sec	Estimated based on scaling curve; browser memory is the constraint

Results vary by hardware, browser, match rate, and number of output columns. Throughput peaks at 1M rows as hash map fits optimally in memory, then decreases at 5M due to GC pressure on larger hash tables.

Join Type Performance Overhead

Inner Join (Default)

Baseline

~212K rows/sec

Returns only rows with matches in both tables. Smallest output, least memory pressure. Hash map built from right table; left rows streamed and matched against it. Fastest join type because unmatched rows are discarded immediately.

Left Join

+8% time

~195K rows/sec

All left rows returned, matched or not. Unmatched left rows get empty values for right columns. Overhead: must write null-padded rows for non-matching left keys. This is the closest equivalent to Excel VLOOKUP — but returns ALL matches, not just the first.

Right Join

+15% time

~185K rows/sec

All right rows returned, matched or not. Requires tracking which right keys were matched during left-table streaming, then outputting unmatched right rows in a second pass. Slightly more overhead than left join due to second-pass tracking.

Full Outer Join

+20% time

~175K rows/sec

All rows from both tables, nulls where no match on either side. Combines left join logic with unmatched right row tracking. Largest possible output (sum of both tables minus inner join rows). Use with caution on large files — output can be much larger than either input.

Anti Join

<0% (faster)

~220K rows/sec

Left rows WITHOUT a match — no right columns assembled. Actually faster than inner join because output rows are simpler (left columns only, no right column assembly). Use to find orphaned records: customers without orders, products not in price lists, invoices without payments.

Semi Join

<0% (faster)

~225K rows/sec

Left rows WITH a match, but without right columns. Fastest of all join types: match confirmed in hash table, left row written as-is, right columns never assembled. Use when you need the left table filtered by the right table but don't want right column data added.

Cross Join warning: Cross join generates a Cartesian product — every left row × every right row. 1,000 × 1,000 rows = 1,000,000 output rows. 5,000 × 5,000 = 25,000,000 rows. SplitForge hard-caps Cross join output at 100,000,000 rows. Pre-join analysis shows estimated output count before you commit. Always run analysis first for Cross joins.

All overhead figures measured on 1M row datasets, February 2026, Chrome (stable), 64GB RAM, Intel i5-12600KF. Results vary by hardware, browser, match rate, and file complexity.

Calculate Your Time Savings

Manual baseline: ~15 minutes per join operation via Excel VLOOKUP — based on internal workflow testing, February 2026. This covers: write VLOOKUP formula, drag down all rows, wrap in IFERROR, copy-paste as values to remove formula dependency, troubleshoot mismatches and #N/A errors, repeat for each join column. SplitForge completes the equivalent join in under 60 seconds including the pre-analysis step, and returns all duplicate matches (not just the first).

Join operations per session:

Typical: 1–4 joins per data prep session

Sessions per year:

Weekly = 52, Monthly = 12, Daily = 260

Hourly rate ($):

Analyst avg: $45–75/hr

Annual Time Saved

25.1

hours per year

Annual Labor Savings

$1,257

per year (vs VLOOKUP workflow)

What you eliminate:

Writing and dragging VLOOKUP formulas across hundreds of thousands of rows
IFERROR wrappers and #N/A troubleshooting
Copy-paste as values to remove formula dependency before sharing
Missed duplicate matches that corrupt aggregations downstream
Excel crashes when joining files over 1,048,576 rows

Testing Methodology

10 runs per config · drop high/low · report avg + range · test datasets available on request

Expand

Honest Limitations: Where SplitForge VLOOKUP/Join Falls Short

No tool is perfect for every use case. Here's where Server-Side Join Tools (SQL Databases, Python pandas, AWS Glue) might be a better choice, and the real limitations of our browser-based architecture.

Browser-Based Processing

Performance depends on your device's RAM and CPU. Modern laptops (2022+) handle 10M+ rows easily, but older devices may struggle with very large files.

Workaround:

Close unnecessary browser tabs to free up memory. For files over 50M rows, consider database solutions.

No Offline Mode (Initial Load)

Requires internet connection to load the tool initially. Processing happens offline in your browser after loading.

Workaround:

Once loaded, you can disconnect and continue processing. For true offline environments, desktop tools may be better.

Browser Tab Memory Limits

Most browsers limit individual tabs to 2-4GB RAM. This is the practical ceiling for file size.

Workaround:

Use 64-bit browsers with sufficient RAM. Chrome and Firefox handle large files best.

Browser Memory Ceiling (Right File Size-Dependent)

The right-side file is fully loaded into a JavaScript hash map. Memory usage depends heavily on column count and string lengths — roughly 200–400MB for a 1M row file, 800MB–1.5GB for a 5M row file (typical business data, 8–12 columns). On 16GB machines with other browser tabs open, you may hit limits well below 5M rows.

Workaround:

Split the right file into chunks, join each chunk against the left file separately, then merge results using CSV Merger. For 50M+ row reference tables, use Python pandas merge() or load data into a database and use SQL JOIN.

No Fuzzy or Approximate Matching

Keys must match exactly (or case-insensitively if toggled). No Levenshtein distance, phonetic matching, or pattern-based matching. 'Smith' and 'Smyth' will not match.

Workaround:

For fuzzy matching, use Python libraries: recordlinkage, fuzzymatcher, or thefuzz. For name standardisation before joining, use SplitForge Data Cleaner to normalize casing and spacing first.

No Automation or API Support

SplitForge is a browser tool — no REST API, CLI, or pipeline integration. Cannot be embedded in ETL workflows or scheduled jobs.

Workaround:

For automation, use Python pandas: df_left.merge(df_right, on='key', how='inner'). For cloud pipelines, AWS Glue, dbt, or any SQL database handle joins at scale with full orchestration.

Join Key Column Names Must Match

The join key column must have identical names in both files. If your files use 'customer_id' and 'CustomerID' for the same concept, you must rename one before uploading.

Workaround:

Rename columns before uploading using SplitForge Column Operations tool. Column name mapping UI is on the roadmap for a future release.

When to Use Server-Side Join Tools (SQL Databases, Python pandas, AWS Glue) Instead

You need joins in an automated ETL or scheduled pipeline

SplitForge has no API. Browser-only workflow cannot run on a schedule or be triggered programmatically.

💡 Python pandas df.merge(), dbt models with SQL JOIN, or AWS Glue transformation jobs.

You need fuzzy or approximate key matching

SplitForge only supports exact (or case-insensitive) matching. Complex match patterns require fuzzy logic.

💡 Python recordlinkage, fuzzymatcher, or PostgreSQL pg_trgm extension for trigram-based fuzzy joins.

You need to join 50M+ row files regularly

Browser memory limits practical ceiling to ~5M right rows. Server-side tools scale horizontally.

💡 PostgreSQL, DuckDB, Snowflake, or BigQuery for large-scale joins. All support standard SQL JOIN syntax.

You need team-shared, version-controlled join configurations

SplitForge join settings aren't saved or shareable — each user configures from scratch each session.

💡 dbt models for SQL joins, or a shared Python script in a team repository.

Questions about limitations? Check our FAQ section below or contact us via the feedback button.

Frequently Asked Questions

How accurate is the 212K rows/second benchmark?

Why does throughput decrease at 5M rows vs 1M rows?

How does the hash-based algorithm compare to Excel VLOOKUP?

What is the difference between join types and their performance impact?

What is the pre-join analysis step and does it affect performance?

What happens with composite key joins (multi-column)?

Can I reproduce these benchmarks?

What is the browser memory limit for joins?

Benchmarks last updated: February 2026. Planned for re-testing after major algorithm changes.

Ready to Join 5M Rows in 23 Seconds?

No installation. Files never uploaded. 7 join types, composite keys, explosion detection — drop your CSVs and see the result in seconds.

5 Million CSV Rows Joinedin 23.6 Seconds (join time, excludes optional analysis step)

Benchmark Performance

Performance at Scale

Join Type Performance Overhead

Calculate Your Time Savings

Testing Methodology

Honest Limitations: Where SplitForge VLOOKUP/Join Falls Short

Browser-Based Processing

No Offline Mode (Initial Load)

Browser Tab Memory Limits

Browser Memory Ceiling (Right File Size-Dependent)

No Fuzzy or Approximate Matching

No Automation or API Support

Join Key Column Names Must Match

When to Use Server-Side Join Tools (SQL Databases, Python pandas, AWS Glue) Instead

You need joins in an automated ETL or scheduled pipeline

You need fuzzy or approximate key matching

You need to join 50M+ row files regularly

You need team-shared, version-controlled join configurations

Frequently Asked Questions

How accurate is the 212K rows/second benchmark?

Why does throughput decrease at 5M rows vs 1M rows?

How does the hash-based algorithm compare to Excel VLOOKUP?

What is the difference between join types and their performance impact?

What is the pre-join analysis step and does it affect performance?

What happens with composite key joins (multi-column)?

Can I reproduce these benchmarks?

What is the browser memory limit for joins?

Ready to Join 5M Rows in 23 Seconds?

5 Million CSV Rows Joined
in 23.6 Seconds (join time, excludes optional analysis step)