Navigated to blog › ai-ready-data-checklist
Back to Blog
ai-data-prep

AI-Ready Data Checklist: 10 Things to Verify Before Upload (2026)

May 28, 2026
12
By SplitForge Team

Before you upload a spreadsheet to ChatGPT, Claude, or Gemini — or send a file through an AI fine-tuning or batch API — verify these 10 items. The checklist runs in any order; missing any one of them is among the most common reasons AI returns wrong answers, silently drops rows, or rejects the file at upload.

TL;DR: The 10-item pre-flight covers encoding, headers, nulls, PII, date formats, output format, file size, aggregation or split strategy, sample testing, and prep documentation. Passing all 10 before upload eliminates the most common causes of AI misinterpretation, silent truncation, and API rejection. Missing item 4 (PII) before a consumer AI plan upload that trains on data by default is the highest-risk omission.

Check encoding and format →


An analyst uploads a 30,000-row sales export to ChatGPT and asks for a monthly revenue breakdown. The output looks confident and well-formatted — until someone spots that 12% of rows have a null customer ID and the date column mixes MM/DD/YYYY with DD/MM/YYYY. ChatGPT hallucinated dates for the ambiguous rows and silently skipped the nulls without any error message. A five-minute pre-flight would have caught both problems before the analysis reached the team.


Methodology: Checklist validated against the AI-prep workflow across SplitForge tools (Data Cleaner, Format & Encoding Checker, Data Masking, Excel to JSON Converter, CSV Splitter, Aggregate & Group, Data Profiler), May 2026.


Table of Contents


Why a Checklist Beats "Just Try It"

AI tools fail silently on bad input in ways that are difficult to detect after the fact. A ChatGPT response built on a file with mixed date formats looks like a real analysis — the model produces coherent sentences with plausible numbers — but the underlying calculation is partially wrong because the AI interpreted ambiguous dates as the wrong month. There is no error message, no flagged row, no indication that 30% of the date column was misread. The analyst sees a confident answer and moves on.

A pre-flight checklist is fast because it runs before the analysis starts, when fixing problems is cheap. Checking encoding takes seconds; re-encoding a file after discovering that accented characters corrupted a name column — and then re-running the analysis — takes much longer. The same applies to PII: masking a column before upload takes minutes; containing a data exposure after the fact is a different problem category entirely.

The 10 items below are not a comprehensive data-cleaning protocol. They are the minimum verification set that catches the categories of error AI tools most commonly fail on: structural problems (encoding, headers), data quality problems (nulls, dates), privacy exposure (PII), format mismatch (CSV vs JSON vs JSONL), and size violations (upload cap, context window). Each item has a verification action and a routing path — no item is "review your data carefully."


The 10-Item AI-Ready Pre-Flight

1. Encoding is UTF-8 (no BOM)

AI tokenizers are built around Unicode; ANSI-encoded files substitute or drop characters outside the basic ASCII range — accented letters, non-Latin scripts, and currency symbols all break silently. A file that displays correctly in Excel can arrive at ChatGPT or Claude with corrupted characters if it was saved in a Windows code page.

Verification: open the file in VS Code and check the encoding label in the bottom status bar, or run it through Format & Encoding Checker for a confirmed encoding report. If non-UTF-8: re-save as CSV UTF-8 from Excel (File → Save As → CSV UTF-8). UTF-8 with BOM is accepted by most AI tools; UTF-8 without BOM is preferred for API ingestion.


2. Column headers are clean and descriptive

AI models use column header text as semantic labels — the headers tell the model what each column represents. "Column1," "Unnamed: 0," "F1," and headers with leading or trailing spaces all degrade the AI's ability to reference and reason about the correct fields; the model may guess the column purpose from values alone, which is unreliable for short or sparse columns.

Verification: scan the first row for unnamed, numbered, duplicate, or whitespace-padded headers. Data Cleaner flags unnamed and duplicate headers and provides a rename interface. "CustomerEmail" beats "Email2"; "OrderDate" beats "Date."


3. No empty cells in critical columns

When the AI encounters a null in a column the question depends on, it typically does one of three things: hallucinates a plausible value, skips the row silently, or conflates the row with an adjacent one. None of these produce an error — they produce a plausible-looking wrong answer, often indistinguishable from correct output.

Verification: a profiling pass generates null counts per column (see Profile Your Data: Generate Statistics from 5M Rows). Focus on the columns that are central to the question — the join key, the date, the metric. Decide whether to fill, drop, or flag those nulls before uploading; any of the three is better than letting the AI decide.


4. PII removed or masked

Consumer AI plans train on uploaded data by default unless you opt out in account settings: ChatGPT Free and Plus, Claude Free, Pro, and Max on claude.ai, and Gemini consumer plans all include a training-consent default in their terms of service. For files containing customer names, email addresses, account numbers, Social Security numbers, or any other personal identifiers, uploading without masking means those records potentially enter training datasets.

Verification: scan column names for PII-bearing fields before upload. For the full masking workflow — 8 techniques including substitution, hashing, pseudonymization, and partial redaction; 50+ PII pattern library; local browser processing — see How to Remove PII From a CSV Before Using AI. Commercial API tiers (OpenAI API, Anthropic API, Google Cloud AI API) are explicitly exempt from training-by-default under their commercial terms; the risk is specific to consumer-plan uploads via the chat interface.


5. Date formats are consistent

A column that mixes MM/DD/YYYY with DD/MM/YYYY — a common artifact of multi-region exports or merged CRM pulls — confuses AI date interpretation systematically. For ambiguous dates like "04/05/2026," the AI makes a consistent assumption (one format throughout) that may be wrong for a significant fraction of rows, and it will not flag the ambiguity.

Verification: spot-check date columns for values where the day exceeds 12 — those are unambiguous format indicators. If mixed: standardize to ISO 8601 (YYYY-MM-DD) with Data Cleaner before uploading; ISO 8601 is universally unambiguous and parses correctly across all AI tools and APIs.


6. Format matches the AI use case

CSV is correct for ChatGPT's Data Analysis tool and Claude's document upload. JSON arrays work for pasting structured data directly into a prompt as context. JSONL is required by OpenAI fine-tuning, the OpenAI batch API, and Anthropic's message batches — submitting a JSON array to a fine-tuning endpoint produces a schema rejection error.

For the full format-decision frame, see Best Format for Feeding Data Into ChatGPT or Claude. For Excel-to-JSONL conversion specifically — including the JSON shape required for each AI use case — see Convert Excel to JSON for AI APIs and LLM Pipelines.


7. File size is within the target's limits

Two independent limits apply: the upload-size cap and the context-window limit. ChatGPT's Data Analysis tool caps spreadsheet uploads at approximately 50MB; Claude accepts up to 500MB per CSV file. The context-window limit — 128,000 tokens for GPT-4o — constrains how much of the uploaded data the model can reason about in one turn, regardless of whether the file uploaded successfully.

For the upload-cap math and rows-per-data-type estimates, see How Many Rows Can ChatGPT Handle?. For the token math and the two-limit framework, see What's a Token? Why Your Spreadsheet Is 'Too Big' for AI. Both limits apply independently — a file can pass the upload cap and still overflow the context window.


8. If too large: aggregated, profiled, or split before uploading

The right pre-size path depends on the question type. For trend and distribution questions ("how did revenue change by quarter?"), aggregate the data before uploading — a 2M-row file becomes 24 monthly rows and the AI reasons over the full dataset in one prompt. See Summarize a Huge CSV Before Sending It to AI. For record-lookup and anomaly questions that require raw rows, split into chunks under the upload cap. See How to Split a Large CSV for ChatGPT Without Uploading It. For structure questions ("what columns and types are in this file?"), run a profiling pass and send the statistics output instead of the raw data.


9. Tested on a 10-row sample first

Converting, masking, formatting, or aggregating a 10-row subset and pasting it into the AI tool before processing the full dataset is the cheapest validation step available. It catches structural parsing errors — the AI misread a column name, the date format is still ambiguous after standardization, the JSONL shape does not match the API schema — before those errors affect the full analysis or abort a batch job.

The 10-row test takes under five minutes and exposes categories of error that post-hoc correction takes hours to undo. If the sample returns correct answers, scale up. If not, the cost of fixing the sample is trivial compared to re-running a large batch.


10. Prep history is documented

Noting which tools and options you applied — which columns were masked and with which technique, which output format, which aggregation columns and functions, which row count per chunk — creates the audit trail for debugging unexpected downstream behavior. If a fine-tuning job produces surprising model output, or a ChatGPT analysis contains a suspicious result, the prep record lets you verify whether the issue is in the source data, the prep step, or the AI's response.

A two-minute plain-text note ("re-encoded to UTF-8, hashed customer email, split to 1,000-row chunks, 2026-05-28") costs almost nothing and pays back quickly when an analysis needs to be re-run or audited. Data Masking produces a compliance summary listing which columns were processed and which technique was applied — save that output alongside the converted file.


The Strong Privacy Item: Item 4 in Full

Item 4 — masking PII before upload — is the highest-stakes item on the list because the failure mode is exposure, not a wrong answer. Consumer AI plans include a training-consent default in their terms: ChatGPT Free and Plus on chat.openai.com, Claude Free, Pro, and Max on claude.ai, and Gemini consumer plans all use uploaded data to improve their models unless you actively opt out in account settings. For files that contain customer names, email addresses, account numbers, Social Security numbers, or medical record identifiers, failing to mask before upload means those records potentially enter training datasets for foundational AI models. The A2 post (ChatGPT vs Claude vs Gemini: File Upload Limits Compared) covers the platform-by-platform retention and training defaults in full.

Commercial API tiers operate under different terms. The OpenAI API, Anthropic API, and Google Cloud AI API are explicitly exempt from training-by-default under their commercial Terms of Service — data submitted via API is used to fulfill the API call and is not used to improve base models without a separate opt-in agreement. If your workflow calls an AI API directly — for fine-tuning, batch inference, or RAG ingestion — the training-by-default risk does not apply. The risk is specific to consumer-plan uploads via the chat interface.

The practical discipline is to mask regardless of tier, as a least-privilege default: only the columns the AI needs for the task should be present, and any column that carries more identifying information than the question requires should be masked or dropped before conversion. For the full masking workflow — 8 techniques, 50+ PII pattern library, local browser processing at 10GB scale — see How to Remove PII From a CSV Before Using AI. For the broader privacy framework around CSV workflows, see Privacy-First Data Processing Guide.


When You Can Skip Steps

The checklist is calibrated for the highest-risk scenario: a large customer export with PII going to a consumer AI plan for open-ended analysis. For lower-stakes use cases, some items do not apply — but skip them because you have verified they do not apply, not because you are guessing.

For a 100-row internal demo file with no customer data, no PII, and a question about a dataset that fits well within the context window: items 3 (nulls), 4 (PII), 7 (size), and 8 (too large) are likely irrelevant. Items 1, 2, 5, 6, 9, and 10 still apply — that is still the majority of the checklist.

For a fine-tuning dataset being submitted via the OpenAI API from a validated internal system: item 4 (PII) still applies even though the API is exempt from training-by-default on submission. The training examples themselves become part of the fine-tuned model's behavior and influence its outputs — they warrant the same scrutiny as any data entering a model, regardless of the submission tier.


Note: For CSV preparation before CRM or database import, see CSV Import Checklist. For a privacy-focused checklist covering GDPR, HIPAA, and data-sharing obligations, see Data Privacy CSV Checklist. This checklist is specific to AI-pipeline ingestion: ChatGPT, Claude, Gemini, fine-tuning endpoints, batch APIs, and RAG pipelines.


Additional Resources

How this guide was built: Checklist validated against OpenAI File Uploads FAQ, OpenAI tokenizer documentation, Anthropic API Terms of Service, and GDPR Article 4 PII definition, May 2026. Consumer training defaults from OpenAI and Anthropic published help documentation. Token estimates from the established ~50–150 tokens/row heuristic documented across A1 and A3.


FAQ

No — calibrate the checklist to the data sensitivity and use case. For a small internal demo file with no PII going to your own API account, items 3, 4, 7, and 8 likely do not apply. For a customer export going to a consumer AI plan, all 10 apply. The rule is to skip items you have verified are irrelevant, not items you have not checked. Item 4 (PII) and item 7 (size) are the two most commonly skipped without verification — both are high-cost misses.

Open the file in VS Code and look at the bottom status bar — the encoding is shown for the active file without any additional steps. If you see "UTF-8," you are ready. If you see "Windows 1252," "Latin-1," or "ANSI," re-save as CSV UTF-8 from Excel (File → Save As → CSV UTF-8 format in the file type dropdown). In Notepad++, check Encoding → show current encoding. The Format & Encoding Checker produces a confirmed report without requiring a manual text editor inspection.

For consumer AI plan uploads, apply GDPR Article 4's definition: any data that identifies or can identify a natural person. In practice: name, email, phone number, postal address, IP address, device identifier, account number, national ID number, and date of birth. For HIPAA contexts, the 18-identifier Safe Harbor list is the operative reference (see How to Remove PII From a CSV Before Using AI for the full list). For internal data with no direct identifiers, assess whether combinations of columns (age + zip code + employer) can together re-identify individuals.

The OpenAI API and Anthropic API are exempt from training-by-default under their commercial terms — data submitted via API is not used to improve base models without a separate opt-in. The masking question for API workflows is about least-privilege, not training: only send the columns the AI needs for the task. If the question is about monthly revenue by region, there is no reason to include customer names in the submitted file. Masking eliminates unnecessary exposure regardless of whether the receiving system trains on the data.

For ChatGPT Data Analysis, XLSX is accepted directly (subject to the ~50MB cap). For Claude's web interface, XLSX requires code execution and file creation to be enabled in account settings — CSV is simpler and works without that dependency. For fine-tuning or batch APIs (OpenAI, Anthropic), Excel is not a valid input format; convert to JSONL first. See Convert Excel to JSON for AI APIs and LLM Pipelines for the shape-by-use-case decision table and the conversion workflow.

Two limits apply independently. The upload cap is approximately 50MB for ChatGPT spreadsheets — this is the byte-level test before the model sees the data. The context-window limit is 128,000 tokens for GPT-4o — this is the content test during processing, equivalent to roughly 850–2,500 rows of typical structured data depending on column width and value length. A file can pass the upload cap and still overflow the context window; always estimate both before uploading. See What's a Token? for the two-limit framework and the token estimate table.

Open the file in VS Code. The encoding label in the bottom status bar updates instantly when you open a file — no configuration needed. If the file shows "UTF-8," it is ready. If it shows anything else, re-save as UTF-8 from Excel before uploading. For a batch of files or files you cannot open locally, the Format & Encoding Checker processes the file in your browser and reports the encoding without uploading it to a server.


Check Encoding and Format Before It Becomes a Problem

Verify UTF-8 encoding before any upload — non-UTF-8 files corrupt characters that AI tokenizers cannot recover from
Run a profiling pass on unknown datasets — null counts, data types, and date distribution reveal what needs fixing before you prompt
Mask PII before any upload to a consumer AI plan — consumer plans train on uploads by default unless you actively opt out
Document your prep steps — a two-minute note on what you masked and how becomes the audit trail when results need re-running

Check encoding and format →

Continue Reading

More guides to help you work smarter with your data

ai-data-prep

Convert Excel to JSON for AI APIs and LLM Pipelines (2026)

AI APIs and LLM pipelines expect JSON, not spreadsheets. Fine-tuning needs JSONL; direct prompts take arrays. Convert locally — no upload, no conversion server.

Read More
ai-data-prep

Prepare Data for AI: The Complete Guide (Privacy-First, 2026)

How to prepare a CSV or Excel file for ChatGPT, Claude, or an AI API — encoding, PII, format, size, and privacy. The complete local-first prep workflow.

Read More