Back to Data Profiler
Internally Verified Benchmarks

10 Million Rows Profiled in 107 Seconds

Verified on Chrome 131, Windows 11, 16GB RAM, Intel i7-12700K. All 11 analysis types running simultaneously. 100% mathematical accuracy confirmed.

Results vary by hardware, browser, and file complexity.

235K
rows/sec
Peak Throughput
94K row file
93K
rows/sec
Steady-State Speed
5M–10M row files
1.26
GB
Max File Tested
10M row file
100
%
Statistical Accuracy
Verified vs NumPy/SciPy
10 Million Rows
106.83 seconds · 150M cells · 11 analysis types · 100% accurate
Throughput
93K rows/sec
File Size
1.26 GB
Statistical Accuracy
100%
vs Excel Row Limit
9.5× limit

Performance Across Dataset Sizes

Chrome 131, Windows 11, 16GB RAM, Intel i7-12700K · Results vary by hardware, browser, and file complexity.

Dataset
Time
Rows/sec
File Size
Status
94K rows
Small file baseline
0.4s
235K/s
~9 MB
Tested
1.5M rows
Mid-scale dataset
14.85s
101K/s
~145 MB
Tested
5M rows
Large-scale analysis
53.45s
94K/s
~485 MB
Verified
10M rows
Maximum verified capacity
106.83s
94K/s
1.26 GB
Verified

Scaling Visualized

Chrome 131, Windows 11, 16GB RAM, Intel i7-12700K. Throughput stabilizes after ~1.5M rows as parsing and analysis overhead converges.

Processing Time (seconds)

Elapsed time grows near-linearly with row count

94K1.5M5M10MDataset size (rows)0s30s60s90s120s

Throughput (K rows/sec)

Stabilizes ~93–101K/s after initial parse overhead

94K1.5M5M10MDataset size (rows)0K65K130K195K260KSteady-state ~93K/s

When Performance Degrades

Our benchmarks use mixed-type files representing typical real-world exports. Specific data characteristics will push processing time above or below these numbers. Here's what to expect.

High Cardinality ColumnsModerate slowdown

Columns with millions of unique string values increase memory allocation for distinct-value counting. A 10M-row file where every row has a unique UUID in a string column will profile ~15% slower than a numeric-only file.

Wide Files (100+ Columns)Significant slowdown

Each additional column adds type detection, statistics, and histogram passes. A 1M-row file with 200 columns will take roughly 3–4× longer than a 1M-row file with 15 columns. Column count matters more than row count at scale.

String-Heavy DatasetsModerate slowdown

Numeric columns process faster than string columns because type inference and statistical calculations are cheaper. Free-text columns (product descriptions, notes fields) add overhead due to whitespace detection and cardinality analysis.

Available Browser MemoryHard ceiling

Browser memory is the practical ceiling. The profiler streams data in chunks to stay within limits, but very large files on machines with <8GB RAM may trigger garbage collection pauses, increasing elapsed time by 20–40%.

Quote-Aware ParsingMinor slowdown

Files with quoted fields containing embedded delimiters require RFC 4180 parsing (~15% slower than simple split). Most real-world CSVs export with some quoted fields, so our benchmark files include a representative mix.

All 11 Analysis Types

Every type runs in a single pass — no re-scanning the file.

Automatic Type Detection
15 data types inferred automatically — numbers, dates (7 formats), strings, booleans.
90–100% confidence score per column. No schema definition required.
Descriptive Statistics
Mean, median, mode, Q1/Q3, IQR, variance, and standard deviation.
Full-precision arithmetic verified against NumPy/SciPy. No floating-point approximations.
Value Histograms
Real value distributions from actual data — not synthetic estimates.
Shows your data's true shape. Calculated from 100% of values on files under the sample threshold.
Quality Issue Detection
Six automated checks: high nulls, outliers (IQR method), whitespace, duplicates, constant columns, near-empty columns.
Each issue type is reported with counts, percentages, and sample values for review.
Cardinality Analysis
Distinct value counts, uniqueness percentages, and primary key identification.
Columns at 95%+ uniqueness are flagged as potential primary keys. Calculated in a single streaming pass.
Top Values Frequency
Most common values with counts and percentages across every column.
Top 10 values by frequency. Useful for spotting dominant categories, encoding issues, or suspicious uniformity.
Cross-Column Insights
Three relationship checks: duplicate row detection, correlated null patterns, candidate foreign keys.
Duplicate detection uses FNV-1a hashing. Foreign key candidates identified by value overlap percentage.
Pearson Correlation
Pearson r coefficients across all numeric column pairs.
Classified by strength: weak (<0.5), moderate (0.5–0.7), strong (>0.7). Top 10 significant pairs shown.
ML Anomaly Detection
Isolation Forest algorithm identifies statistically anomalous rows across all numeric columns.
Anomaly score 0–1. Moderate >0.65, High >0.7, Critical >0.8. No labelled training data required.
Time Series Patterns
Trend detection (increasing / decreasing / stable), data frequency, and gap analysis for date columns.
Auto-detects daily, weekly, monthly patterns. Reports gaps with start/end dates and day counts.
Client-Side Architecture
All 10 analysis types run in Web Workers inside your browser. Zero server communication.
Designed to avoid server transmission of PHI or PII. No network requests made during file processing. Compliance with your organization's specific data handling policies remains your responsibility.

Mathematically Verified Accuracy

Statistics were verified against closed-form formulas. For a sequential ID column (1 to 10,000,000), the correct sum is n(n+1)/2. The profiler produces the exact value — no rounding, no approximation.

Formula: n(n+1)/2
50,000,005,000,000
Profiler Sum
50,000,005,000,000
Expected Mean
5,000,000.5
Profiler Mean
5,000,000.5

Why Client-Side Processing Matters for Performance

Cloud profiling tools are bottlenecked by upload time before processing even starts. A 500MB file at 100 Mbps takes 40 seconds to upload — before analysis begins. SplitForge reads directly from your local disk via the browser FileReader API. No network round-trip. The clock starts immediately.

Web Workers offload all analysis to a background thread, keeping the UI responsive during long profiling runs. The streaming architecture processes data in chunks — memory usage stays bounded regardless of file size, which is why 1.26 GB files complete without crashing the tab.

Benchmark methodology: All tests run on Chrome 131 (stable), Windows 11, Intel i7-12700K, 16GB DDR4 RAM. Files contained mixed data types (integers, floats, strings, dates) to reflect real-world conditions. Each benchmark was run three times; times reported are the median. Results vary by hardware, browser version, available RAM, and file data complexity. Your results may differ.
RelatedData Profiler Landing PageProfiling Guide: 5M RowsOpen the Tool
Try Data Profiler Now

Free · No account required · Designed to avoid server transmission of PHI · Files never leave your browser

Benchmarks run February 2026 · SplitForge v2.1 · Chrome 131, Windows 11, Intel i7-12700K, 16GB DDR4 RAM
Median of 3 runs per dataset size · Mixed data types (integers, floats, strings, dates)
Next scheduled re-benchmark: May 15, 2026