ai-data-prep

What's a Token? Why Your Spreadsheet Is 'Too Big' for AI (2026)

May 28, 2026

By SplitForge Team

A token is roughly four characters of text — the unit AI models use to read and process your input. When a spreadsheet is "too big" for an AI tool, the cause is one of two separate limits: the upload-size cap (a file-size test the platform runs before reading your data) or the context-window limit (a token-count test the model runs as it processes your data). These are independent constraints, and nearly everyone conflates them.

TL;DR: A token is approximately four characters; a typical structured CSV row is 50–150 tokens — always an estimate, never a precise figure. GPT-4o's 128,000-token context window controls how much of your spreadsheet the model can reason about in one turn, and it is a separate constraint from ChatGPT's ~50MB upload cap. If your file exceeds either limit, splitting by rows before uploading resolves both.

Split your CSV to fit →

You upload a 50,000-row export to ChatGPT, the upload succeeds, and you ask it to summarize the data — but when you check the row count it processed, ChatGPT reports 12,000. The other 38,000 rows were inside the upload size cap but outside the model's 128,000-token context window: a limit that has nothing to do with file size. That gap between "will it upload" and "can it actually read all of it" is what this post explains.

What Is a Token?
Why Spreadsheets Are Token-Heavy
The Two Reasons Your File Is 'Too Big'
How Many Rows Fit in the 128K Context Window?
Verify Your Exact Token Count in 60 Seconds
How to Tell If Your File Will Fit
What to Do About It
Additional Resources
FAQ

What Is a Token?

A token is the smallest unit of text an AI language model processes when it reads input. OpenAI's published rule of thumb is approximately 1 token per 4 characters of English text — or roughly 0.75 words per token. The word "spreadsheet" is about 2 tokens; the date "2026-01-15" is roughly 3; the phrase "customer ID" is about 2 or 3 depending on how the tokenizer splits it.

Tokenization is not word-splitting. Models use a subword encoding algorithm that compresses common words into single tokens and splits uncommon sequences — structured identifiers, dates, and numeric codes — into multiple pieces. "January" is typically one token; "01/15/2026" in a date column might tokenize into four or five, because the exact character sequence is not a unit the model has learned to compress efficiently.

The practical implication: every character in your CSV costs tokens, not just the readable words. Cell values, comma delimiters, quotation marks around fields that contain commas, and the newline at the end of each row all count — which is why structured data behaves very differently from prose text of the same character count.

Why Spreadsheets Are Token-Heavy

Spreadsheets use more tokens per character than prose text because their structure — repeated column patterns, numeric identifiers, dates, and category strings — resists the compression that makes common English words token-efficient. Every cell value, every comma delimiter, every quoted field boundary, and every row-ending newline in a CSV file consumes tokens. That overhead repeats identically across every row, and it compounds at scale: what looks like a moderate file size can represent an enormous token budget.

The column headers are the one efficient part: written once and amortized across all rows. Everything else scales linearly. A customer ID like "CUST-00892471" might tokenize into five or six tokens on its own — multiply that by your row count and a single column's contribution to your token budget becomes significant.

For context: a 5,000-word article (roughly 35,000 characters) is approximately 8,750 tokens. A 5,000-row CSV with 8 columns and mixed data might be 400,000–750,000 tokens — far more per character than prose, because structured data resists the compression that natural language tokenizers are optimized for. This is why the same row count can represent very different token totals depending on what the data contains.

The Two Reasons Your File Is 'Too Big'

A spreadsheet is "too big" for AI for one of two distinct reasons, and they operate at different stages. The first is the upload-size cap: the file must be small enough in bytes to transfer to the platform before any processing begins. The second is the context-window limit: even after a successful upload, the model can only actively reason about as many tokens as fit in its context window — 128,000 for GPT-4o. The upload cap is a file-size test; the context window is a content test; both can fail independently of each other.

The Upload-Size Cap

The upload cap is a byte limit — it applies before the AI reads a single character of your data. ChatGPT's Data Analysis tool caps CSV and spreadsheet uploads at approximately 50MB per file. A file that exceeds this is rejected at the upload step, before the model is involved at all.

How many rows 50MB holds depends entirely on column count and cell length — a narrow numeric export can fit 300,000+ rows in 50MB, while a wide CRM export with long text fields might cap out at 30,000. For the full file-size math and a rows-per-type breakdown, see How Many Rows Can ChatGPT Handle?.

The Context-Window Limit

The context window is a token limit — it controls how much of your data the model can hold in active memory and reason about in a single conversation turn. GPT-4o's context window is 128,000 tokens. A file can upload successfully and still exceed the context window if its tokenized content exceeds that cap.

This is the limit behind "silent truncation": ChatGPT accepts the file, begins analysis, but processes only the rows that fit within the 128K window — and often responds as if it analyzed everything. The upload cap produces an error; the context window often produces no error at all, just a result based on fewer rows than you sent.

The two limits are independent. A 10MB file that is token-dense — a CSV with long free-text notes in every row — can upload successfully but overflow the context window. A 60MB file with sparse numeric data fails the upload cap but would fit the context window easily if it were compressed to a smaller size first.

How Many Rows Fit in the 128K Context Window?

The number of rows that fit in GPT-4o's 128,000-token context window depends entirely on your data's token density — column count, value length, and data type. The table below maps common data shapes to an estimated tokens-per-row and the approximate row count that fills the 128K window. All figures are estimates based on the ~50–150 tokens/row heuristic; actual token usage varies by encoding and content, and the right approach is to test with a smaller chunk and verify ChatGPT's reported row count before scaling up.

Data type	Est. tokens/row	Approx. rows in 128K context	Notes
Narrow numeric (bank transactions, 4–5 cols)	~50	~2,500	Best case — short values, few delimiters
Standard CRM / contact export (8–12 cols)	~100	~1,280	Email and phone fields raise per-row token cost
Wide operational export (13–20 cols)	~150	~850	Many columns × moderate text
Free-text / notes-heavy (any cols + long text)	~300–500+	~250–425	Long-text fields dominate; start much smaller
Pre-aggregated summary (GROUP BY result)	~30–50	~2,500–4,250	Uniform short values; highest token efficiency

These estimates use the same ~50–150 tokens/row heuristic documented in How Many Rows Can ChatGPT Handle? — and carry the same caveat: actual token usage varies significantly based on text length, encoding, and how the model interprets your data structure.

Verify Your Exact Token Count in 60 Seconds

The 50–150 tokens/row estimate is a heuristic. For datasets where the token budget matters — splitting decisions, fine-tuning dataset sizing, RAG ingestion planning — verify the actual token count of your data using OpenAI's public tokenizer.

Open a 5-row sample. From your CSV, copy 5 rows including the header row. Use rows that are representative of your data's typical content — not the shortest or longest values.
Paste into the OpenAI tokenizer at platform.openai.com/tokenizer. The tool displays the total token count for the pasted text plus a character count for comparison.
Calculate your dataset's actual tokens/row. Divide the total token count by 5 (your sample row count). Subtract the header-row contribution (typically 10–30 tokens) if precision matters. Multiply by your total row count to estimate your dataset's full token budget.

A 5-row sample takes 30 seconds to gather and produces a far more accurate estimate than the heuristic table for your specific data shape. For files near the 128,000-token GPT-4o context window, this verification step is the difference between a successful single-prompt analysis and silent truncation.

How to Tell If Your File Will Fit

Estimating whether your file will exceed the context window takes four steps. The key insight: a file can pass the upload test (under 50MB) and still fail the context-window test, because the two limits are independent. Use the table above as a starting point and go conservative — it is faster to upload an extra chunk than to discover mid-analysis that ChatGPT silently dropped the last 30,000 rows.

Identify your data type from the table above and note the estimated tokens-per-row range for your column count and content.
Multiply tokens/row by your row count to get an estimated total token count. If your file has 5,000 rows of CRM data at ~100 tokens/row, that is approximately 500,000 tokens — well above the 128K context window.
Compare to both limits independently. The upload cap (~50MB) and the context-window cap (128,000 tokens) must both be satisfied. A file can fail one while passing the other — check file size first, then token estimate.
If over either limit, split by rows before uploading. A 1,000-row chunk of CRM data is approximately 100,000 tokens — within the 128K window and well under the upload cap. Splitting resolves both limits with the same action.

What to Do About It

If your file exceeds the upload cap, reduce its size before uploading — the most common approach is to split by rows or, for Excel workbooks, strip formatting overhead by exporting to CSV first. If it exceeds the context window, you have three options: split into smaller chunks, pre-aggregate the data before uploading, or reduce the column count to lower the per-row token cost. All three are most effective when done locally — before any file reaches a remote system.

Split by rows. Each chunk contains a clean header row and whole data rows, keeping every chunk under both the upload cap and the context-window limit. Use Split by Rows mode in CSV Splitter — each output file gets the original header and a contiguous block of complete rows, with no mid-row breaks at file boundaries. For the chunk-sizing table and multi-chunk workflow, see How to Split a Large CSV for ChatGPT Without Uploading It. For Excel workbooks, see Excel File Too Big for AI? Reduce It in Your Browser First.

Pre-aggregate first. If your goal is a summary — totals by category, averages by month, counts by status — aggregate the full dataset to a GROUP BY result before uploading. A 500,000-row order export reduces to a 200-row summary that fits any context window in a single ChatGPT session. For the pre-aggregation workflow, see Summarize a Huge CSV Before Feeding It to AI.

Reduce column count. If only 4 of your 15 columns are relevant to your question, drop the other 11 before uploading — each removed column reduces per-row token cost and lets more rows fit within the same context window. Removing personally identifiable columns before upload also limits data exposure regardless of which platform receives the file.

The split or export happens on your device — the raw file stays in your environment during the process, and only the reduced output reaches any remote system.

Additional Resources

How this guide was built: Token estimates derived from OpenAI's published tokenizer documentation and the ~4 characters/token rule of thumb, May 2026. ChatGPT upload caps from the OpenAI File Uploads FAQ. GPT-4o context window from OpenAI's model documentation. All token figures labeled as estimates.

OpenAI: What Are Tokens and How to Count Them — OpenAI's published rule of thumb: ~4 characters per token, ~0.75 words per token; includes a tokenizer tool for exact counts on specific text.
OpenAI: File Uploads FAQ — The ~50MB spreadsheet upload cap, per-file hard cap (512MB), and upload frequency limits for ChatGPT Plus.
OpenAI: GPT-4o model — 128,000-token context window for GPT-4o; the processing cap that determines how much of your file the model can reason about in one turn.
MDN: Web Workers API — How browser-based workers run computation on your device without a server; the mechanism behind on-device file splitting.
RFC 4180: Common Format and MIME Type for CSV Files — The CSV structural standard; defines the header row and row-boundary rules relevant to how tokenizers parse structured data.
How Many Rows Can ChatGPT Handle? — The file-size math, rows-per-type estimates, and a full breakdown of ChatGPT's upload limits and failure modes.
How to Split a Large CSV for ChatGPT Without Uploading It — Step-by-step split workflow: chunk sizing, Split by Rows vs. Equal Parts, and the privacy case for splitting locally.
Excel File Too Big for AI? Reduce It in Your Browser First — Excel-specific reduction paths: strip formatting overhead, split by sheet, export to CSV or JSONL.

FAQ

A token is the smallest unit of text a language model processes. OpenAI's documented rule of thumb is approximately 1 token per 4 characters of English text, or about 0.75 words per token. Tokenization uses a subword encoding algorithm — common words compress into single tokens, while numbers, dates, and structured identifiers often split into several. For practical purposes: a typical English word is 1–2 tokens; a 10-digit customer ID or a date string might be 4–6.

Approximately 50–150 tokens for a typical structured row — but this is an estimate, not a tool-calculated figure. A narrow numeric row (4–5 columns of dates and amounts) is closer to 50 tokens; a wide CRM row with email addresses, phone numbers, and free-text notes can reach 150 or more. Token usage varies by column count, value length, and encoding. The only way to get an exact count for your specific data is to run a sample through a tokenizer tool.

If ChatGPT processes fewer rows than you uploaded, the cause is the context window, not the upload cap. The upload succeeded — the file was under 50MB — but the file's total token count exceeded the 128,000-token context window, and ChatGPT processed only the rows that fit. Always ask "how many rows are in this file?" immediately after upload; if the reported count is less than your actual row count, split the file into smaller chunks and re-upload each one.

The file size limit — approximately 50MB for ChatGPT's Data Analysis tool — is a byte test that determines whether your file can be transferred to the platform at all; fail it and you get an upload error. The context window — 128,000 tokens for GPT-4o — is a content test that determines how much of the uploaded data the model can actively reason about in one turn; fail it and you may get no error, just a result based on fewer rows than you sent. Both limits are real, both can be exceeded independently, and a file can pass one while failing the other.

It depends on your data's token density. As rough estimates: narrow numeric data (4–5 short columns) fits approximately 2,500 rows; a standard CRM export (8–12 columns) fits approximately 1,280 rows; a wide operational export (13–20 columns) fits approximately 850 rows; files with long free-text columns may fit only 250–425 rows. All figures use the ~50–150 tokens/row heuristic and should be treated as starting points — verify by checking the row count after each upload before running analysis.

Yes — the same action resolves both. Splitting by rows produces smaller files that pass the upload cap (under 50MB) and smaller token counts that fit within the 128K context window. The key is choosing a chunk size that satisfies both: use the tokens-per-row estimate from the table above as your ceiling and go conservative. For the full chunk-sizing table and multi-chunk workflow, see How to Split a Large CSV for ChatGPT Without Uploading It.

ChatGPT will typically accept the upload and begin analysis, but process only the portion of your data that fits within the 128,000-token context window — often without any error message. This is the silent truncation failure mode: ChatGPT may respond as if it analyzed everything while having seen only a fraction of your rows. The safest practice is to ask "how many rows are in this file?" immediately after upload and compare the reported count to your known row count before running any analysis.

Split First, Then Upload

Estimate tokens/row from your data type — narrow numeric rows run ~50 tokens; wide CRM exports with text fields run ~100–150
Split by rows if your total token estimate exceeds 128K — each chunk stays within the context window and under the 50MB upload cap
Verify the row count after each upload — ask "how many rows are in this file?" to catch silent truncation before it skews your analysis
Split locally so your full raw file never leaves your device — no intermediate server, no extra exposure before the AI sees it

Split your CSV to fit →

What's a Token? Why Your Spreadsheet Is 'Too Big' for AI (2026)

Table of Contents

What Is a Token?

Why Spreadsheets Are Token-Heavy

The Two Reasons Your File Is 'Too Big'

The Upload-Size Cap

The Context-Window Limit

How Many Rows Fit in the 128K Context Window?

Verify Your Exact Token Count in 60 Seconds

How to Tell If Your File Will Fit

What to Do About It

Additional Resources

FAQ

What is a token in AI?

How many tokens is a row of CSV data?

Why does ChatGPT forget part of my spreadsheet after uploading?

What's the difference between the file size limit and the context window?

How many rows fit in GPT-4o's 128K context window?

Does splitting a CSV solve both the upload cap and the context-window problem?

What happens if my file is under 50MB but still too large for the context window?

Split First, Then Upload

Table of Contents

What Is a Token?

Why Spreadsheets Are Token-Heavy

The Two Reasons Your File Is 'Too Big'

The Upload-Size Cap

The Context-Window Limit

How Many Rows Fit in the 128K Context Window?

Verify Your Exact Token Count in 60 Seconds

How to Tell If Your File Will Fit

What to Do About It

Additional Resources

FAQ

What is a token in AI?

How many tokens is a row of CSV data?

Why does ChatGPT forget part of my spreadsheet after uploading?

What's the difference between the file size limit and the context window?

How many rows fit in GPT-4o's 128K context window?

Does splitting a CSV solve both the upload cap and the context-window problem?

What happens if my file is under 50MB but still too large for the context window?

Split First, Then Upload

Continue Reading

Extract Phone Numbers from CSV Without the Junk (2026 Guide)

AI-Ready Data Checklist: 10 Things to Verify Before Upload (2026)

Convert Excel to JSON for AI APIs and LLM Pipelines (2026)