A token is roughly four characters of text — the unit AI models use to read and process your input. When a spreadsheet is "too big" for an AI tool, the cause is one of two separate limits: the upload-size cap (a file-size test the platform runs before reading your data) or the context-window limit (a token-count test the model runs as it processes your data). These are independent constraints, and nearly everyone conflates them.
TL;DR: A token is approximately four characters; a typical structured CSV row is 50–150 tokens — always an estimate, never a precise figure. GPT-4o's 128,000-token context window controls how much of your spreadsheet the model can reason about in one turn, and it is a separate constraint from ChatGPT's ~50MB upload cap. If your file exceeds either limit, splitting by rows before uploading resolves both.
You upload a 50,000-row export to ChatGPT, the upload succeeds, and you ask it to summarize the data — but when you check the row count it processed, ChatGPT reports 12,000. The other 38,000 rows were inside the upload size cap but outside the model's 128,000-token context window: a limit that has nothing to do with file size. That gap between "will it upload" and "can it actually read all of it" is what this post explains.
Table of Contents
- What Is a Token?
- Why Spreadsheets Are Token-Heavy
- The Two Reasons Your File Is 'Too Big'
- How Many Rows Fit in the 128K Context Window?
- Verify Your Exact Token Count in 60 Seconds
- How to Tell If Your File Will Fit
- What to Do About It
- Additional Resources
- FAQ
What Is a Token?
A token is the smallest unit of text an AI language model processes when it reads input. OpenAI's published rule of thumb is approximately 1 token per 4 characters of English text — or roughly 0.75 words per token. The word "spreadsheet" is about 2 tokens; the date "2026-01-15" is roughly 3; the phrase "customer ID" is about 2 or 3 depending on how the tokenizer splits it.
Tokenization is not word-splitting. Models use a subword encoding algorithm that compresses common words into single tokens and splits uncommon sequences — structured identifiers, dates, and numeric codes — into multiple pieces. "January" is typically one token; "01/15/2026" in a date column might tokenize into four or five, because the exact character sequence is not a unit the model has learned to compress efficiently.
The practical implication: every character in your CSV costs tokens, not just the readable words. Cell values, comma delimiters, quotation marks around fields that contain commas, and the newline at the end of each row all count — which is why structured data behaves very differently from prose text of the same character count.
Why Spreadsheets Are Token-Heavy
Spreadsheets use more tokens per character than prose text because their structure — repeated column patterns, numeric identifiers, dates, and category strings — resists the compression that makes common English words token-efficient. Every cell value, every comma delimiter, every quoted field boundary, and every row-ending newline in a CSV file consumes tokens. That overhead repeats identically across every row, and it compounds at scale: what looks like a moderate file size can represent an enormous token budget.
The column headers are the one efficient part: written once and amortized across all rows. Everything else scales linearly. A customer ID like "CUST-00892471" might tokenize into five or six tokens on its own — multiply that by your row count and a single column's contribution to your token budget becomes significant.
For context: a 5,000-word article (roughly 35,000 characters) is approximately 8,750 tokens. A 5,000-row CSV with 8 columns and mixed data might be 400,000–750,000 tokens — far more per character than prose, because structured data resists the compression that natural language tokenizers are optimized for. This is why the same row count can represent very different token totals depending on what the data contains.
The Two Reasons Your File Is 'Too Big'
A spreadsheet is "too big" for AI for one of two distinct reasons, and they operate at different stages. The first is the upload-size cap: the file must be small enough in bytes to transfer to the platform before any processing begins. The second is the context-window limit: even after a successful upload, the model can only actively reason about as many tokens as fit in its context window — 128,000 for GPT-4o. The upload cap is a file-size test; the context window is a content test; both can fail independently of each other.
The Upload-Size Cap
The upload cap is a byte limit — it applies before the AI reads a single character of your data. ChatGPT's Data Analysis tool caps CSV and spreadsheet uploads at approximately 50MB per file. A file that exceeds this is rejected at the upload step, before the model is involved at all.
How many rows 50MB holds depends entirely on column count and cell length — a narrow numeric export can fit 300,000+ rows in 50MB, while a wide CRM export with long text fields might cap out at 30,000. For the full file-size math and a rows-per-type breakdown, see How Many Rows Can ChatGPT Handle?.
The Context-Window Limit
The context window is a token limit — it controls how much of your data the model can hold in active memory and reason about in a single conversation turn. GPT-4o's context window is 128,000 tokens. A file can upload successfully and still exceed the context window if its tokenized content exceeds that cap.
This is the limit behind "silent truncation": ChatGPT accepts the file, begins analysis, but processes only the rows that fit within the 128K window — and often responds as if it analyzed everything. The upload cap produces an error; the context window often produces no error at all, just a result based on fewer rows than you sent.
The two limits are independent. A 10MB file that is token-dense — a CSV with long free-text notes in every row — can upload successfully but overflow the context window. A 60MB file with sparse numeric data fails the upload cap but would fit the context window easily if it were compressed to a smaller size first.
How Many Rows Fit in the 128K Context Window?
The number of rows that fit in GPT-4o's 128,000-token context window depends entirely on your data's token density — column count, value length, and data type. The table below maps common data shapes to an estimated tokens-per-row and the approximate row count that fills the 128K window. All figures are estimates based on the ~50–150 tokens/row heuristic; actual token usage varies by encoding and content, and the right approach is to test with a smaller chunk and verify ChatGPT's reported row count before scaling up.
| Data type | Est. tokens/row | Approx. rows in 128K context | Notes |
|---|---|---|---|
| Narrow numeric (bank transactions, 4–5 cols) | ~50 | ~2,500 | Best case — short values, few delimiters |
| Standard CRM / contact export (8–12 cols) | ~100 | ~1,280 | Email and phone fields raise per-row token cost |
| Wide operational export (13–20 cols) | ~150 | ~850 | Many columns × moderate text |
| Free-text / notes-heavy (any cols + long text) | ~300–500+ | ~250–425 | Long-text fields dominate; start much smaller |
| Pre-aggregated summary (GROUP BY result) | ~30–50 | ~2,500–4,250 | Uniform short values; highest token efficiency |
These estimates use the same ~50–150 tokens/row heuristic documented in How Many Rows Can ChatGPT Handle? — and carry the same caveat: actual token usage varies significantly based on text length, encoding, and how the model interprets your data structure.
Verify Your Exact Token Count in 60 Seconds
The 50–150 tokens/row estimate is a heuristic. For datasets where the token budget matters — splitting decisions, fine-tuning dataset sizing, RAG ingestion planning — verify the actual token count of your data using OpenAI's public tokenizer.
-
Open a 5-row sample. From your CSV, copy 5 rows including the header row. Use rows that are representative of your data's typical content — not the shortest or longest values.
-
Paste into the OpenAI tokenizer at platform.openai.com/tokenizer. The tool displays the total token count for the pasted text plus a character count for comparison.
-
Calculate your dataset's actual tokens/row. Divide the total token count by 5 (your sample row count). Subtract the header-row contribution (typically 10–30 tokens) if precision matters. Multiply by your total row count to estimate your dataset's full token budget.
A 5-row sample takes 30 seconds to gather and produces a far more accurate estimate than the heuristic table for your specific data shape. For files near the 128,000-token GPT-4o context window, this verification step is the difference between a successful single-prompt analysis and silent truncation.
How to Tell If Your File Will Fit
Estimating whether your file will exceed the context window takes four steps. The key insight: a file can pass the upload test (under 50MB) and still fail the context-window test, because the two limits are independent. Use the table above as a starting point and go conservative — it is faster to upload an extra chunk than to discover mid-analysis that ChatGPT silently dropped the last 30,000 rows.
- Identify your data type from the table above and note the estimated tokens-per-row range for your column count and content.
- Multiply tokens/row by your row count to get an estimated total token count. If your file has 5,000 rows of CRM data at ~100 tokens/row, that is approximately 500,000 tokens — well above the 128K context window.
- Compare to both limits independently. The upload cap (~50MB) and the context-window cap (128,000 tokens) must both be satisfied. A file can fail one while passing the other — check file size first, then token estimate.
- If over either limit, split by rows before uploading. A 1,000-row chunk of CRM data is approximately 100,000 tokens — within the 128K window and well under the upload cap. Splitting resolves both limits with the same action.
What to Do About It
If your file exceeds the upload cap, reduce its size before uploading — the most common approach is to split by rows or, for Excel workbooks, strip formatting overhead by exporting to CSV first. If it exceeds the context window, you have three options: split into smaller chunks, pre-aggregate the data before uploading, or reduce the column count to lower the per-row token cost. All three are most effective when done locally — before any file reaches a remote system.
Split by rows. Each chunk contains a clean header row and whole data rows, keeping every chunk under both the upload cap and the context-window limit. Use Split by Rows mode in CSV Splitter — each output file gets the original header and a contiguous block of complete rows, with no mid-row breaks at file boundaries. For the chunk-sizing table and multi-chunk workflow, see How to Split a Large CSV for ChatGPT Without Uploading It. For Excel workbooks, see Excel File Too Big for AI? Reduce It in Your Browser First.
Pre-aggregate first. If your goal is a summary — totals by category, averages by month, counts by status — aggregate the full dataset to a GROUP BY result before uploading. A 500,000-row order export reduces to a 200-row summary that fits any context window in a single ChatGPT session. For the pre-aggregation workflow, see Summarize a Huge CSV Before Feeding It to AI.
Reduce column count. If only 4 of your 15 columns are relevant to your question, drop the other 11 before uploading — each removed column reduces per-row token cost and lets more rows fit within the same context window. Removing personally identifiable columns before upload also limits data exposure regardless of which platform receives the file.
The split or export happens on your device — the raw file stays in your environment during the process, and only the reduced output reaches any remote system.
Additional Resources
How this guide was built: Token estimates derived from OpenAI's published tokenizer documentation and the ~4 characters/token rule of thumb, May 2026. ChatGPT upload caps from the OpenAI File Uploads FAQ. GPT-4o context window from OpenAI's model documentation. All token figures labeled as estimates.
- OpenAI: What Are Tokens and How to Count Them — OpenAI's published rule of thumb: ~4 characters per token, ~0.75 words per token; includes a tokenizer tool for exact counts on specific text.
- OpenAI: File Uploads FAQ — The ~50MB spreadsheet upload cap, per-file hard cap (512MB), and upload frequency limits for ChatGPT Plus.
- OpenAI: GPT-4o model — 128,000-token context window for GPT-4o; the processing cap that determines how much of your file the model can reason about in one turn.
- MDN: Web Workers API — How browser-based workers run computation on your device without a server; the mechanism behind on-device file splitting.
- RFC 4180: Common Format and MIME Type for CSV Files — The CSV structural standard; defines the header row and row-boundary rules relevant to how tokenizers parse structured data.
- How Many Rows Can ChatGPT Handle? — The file-size math, rows-per-type estimates, and a full breakdown of ChatGPT's upload limits and failure modes.
- How to Split a Large CSV for ChatGPT Without Uploading It — Step-by-step split workflow: chunk sizing, Split by Rows vs. Equal Parts, and the privacy case for splitting locally.
- Excel File Too Big for AI? Reduce It in Your Browser First — Excel-specific reduction paths: strip formatting overhead, split by sheet, export to CSV or JSONL.
FAQ
Split First, Then Upload
Estimate tokens/row from your data type — narrow numeric rows run ~50 tokens; wide CRM exports with text fields run ~100–150
Split by rows if your total token estimate exceeds 128K — each chunk stays within the context window and under the 50MB upload cap
Verify the row count after each upload — ask "how many rows are in this file?" to catch silent truncation before it skews your analysis
Split locally so your full raw file never leaves your device — no intermediate server, no extra exposure before the AI sees it