Navigated to blog › extract-product-codes-csv
Back to Blog
csv-guides

Extract and Clean Product Codes from CSV Data

March 15, 2026
11
By SplitForge Team

Fast Fix (60 Seconds)

If you have product codes buried in a CSV column right now:

  1. Open SplitForge Pattern Extraction — no account required
  2. Upload your product CSV
  3. Select the column containing the buried codes (Description, Title, Notes)
  4. Choose ID/Code mode and enter your code prefix (e.g., "SKU-" or "ASIN:")
  5. Download — extracted codes appear in a new column

No upload. No Python. Runs in your browser.


TL;DR: Product catalog CSVs frequently bury SKUs, ASINs, GTINs, and internal codes inside description fields, product titles, or notes columns. Manually isolating these codes across thousands of rows causes errors and delays catalog launches. SplitForge Pattern Extraction identifies and extracts any product code format — with or without a known prefix — in a single browser-based pass.


Your supplier sent a 12,000-row product catalog. The SKU is embedded in the product title: "Blue Widget 500ml [SKU-8847] Case of 12." Your inventory system needs a clean SKU column to import. Your options are to write a formula for every row, pay someone to clean it manually, or spend two hours on a Python script.

There's a faster way.

Each extraction scenario was tested using SplitForge Pattern Extraction against real-world product catalog exports from e-commerce and inventory systems ranging from 5,000 to 400,000 rows, March 2026. In one wholesale catalog we analyzed, SKUs appeared in five different column locations across the file — title, description, notes, a dedicated SKU field, and a legacy reference column — all requiring extraction and consolidation before a clean import was possible.


Product code extraction has one failure mode that breaks basic tools: codes without consistent prefixes. An ASIN is always 10 characters starting with B0 — a reliable pattern. A supplier's internal SKU might be 6 alphanumeric characters with no prefix at all. The strategy for each is different, and most guides only cover the easy prefix case. Files are streamed and processed in dedicated Web Worker threads to avoid blocking the UI thread — your catalog data never leaves your browser. This guide covers all extraction strategies.


Product Code Formats and How to Extract Each

Product codes follow predictable formats once you know the standard. The challenge is that different marketplaces and systems use different conventions — and a single catalog often mixes several.

Code TypeFormatExampleExtraction Strategy
ASIN (Amazon)10 chars, starts with B0B0CXYZ1234Pattern: B0 + 8 alphanumeric
SKU (internal)Varies — often prefix + digitsSKU-88471Anchor to prefix
UPC-A (US barcode)12 digits exactly01234567890112-digit numeric string
EAN-13 (EU barcode)13 digits exactly012345678901213-digit numeric string
GTIN-14 (logistics)14 digits exactly0123456789012814-digit numeric string
ISBN-1313 digits, starts with 978 or 9799781234567890Anchor to 978/979 prefix
Custom internal codeAlphanumeric, often 6-10 charsA1B2C3Define length and character class

This is the reference table. Before running any extraction, identify your code type and select the matching strategy. Getting this right before configuring the tool saves multiple re-runs.

Which Extraction Strategy Should You Use?

Before opening the tool, answer these three questions. The answers determine your configuration:

Do you know the code prefix? (e.g., "SKU-", "ASIN:", "INV-")
│
├── YES → Use Prefix + Length mode
│         Enter the prefix. Set expected character count after prefix.
│         False positive risk: very low.
│
└── NO → Is the code a fixed numeric length? (UPC, EAN, GTIN, ISBN)
          │
          ├── YES → Use Exact Numeric Length mode
          │         Set exact digit count (12, 13, or 14).
          │         Enable check-digit validation if available.
          │         False positive risk: low with strict exact length.
          │
          └── NO → Do you know the character type and approximate length?
                    │
                    ├── YES → Use Character Class + Length Range mode
                    │         Define alphanumeric vs digits-only.
                    │         Set min/max length.
                    │         False positive risk: medium — always preview first.
                    │
                    └── NO → Sample 100–200 rows manually first.
                              Identify the most consistent structural feature.
                              Return to step 1 with that anchor.

The decision point most people skip is the last one. If you can't describe the code format in one sentence, extraction will produce garbage. A 10-minute manual review of 100 rows is faster than re-running extraction 5 times trying to tune a pattern against nothing specific.


Table of Contents


This guide is for: E-commerce managers, catalog teams, Amazon sellers, and inventory managers who need to isolate product identifiers from mixed-content CSV columns.


How to Extract SKUs with a Known Prefix

SKUs with consistent prefixes are the most straightforward extraction case. If your SKUs always start with "SKU-", "ITEM-", or another fixed string, the prefix anchors the pattern reliably and reduces false positives to near zero.

Step 1: Identify the prefix and code structure

Examine 10–20 rows in your source column and answer these questions:

  • Does the prefix appear consistently? (SKU-XXXX vs some rows with SKU_XXXX — note the separator difference)
  • What characters follow the prefix — digits only, alphanumeric, or mixed?
  • Is the code length fixed or variable?

Step 2: Configure extraction

  1. Open SplitForge Pattern Extraction
  2. Upload your CSV and select the source column
  3. Choose ID/Code mode
  4. Enter your prefix in the Prefix field (e.g., "SKU-")
  5. Set the expected code length or range (e.g., 4–8 characters after the prefix)

Step 3: Handle prefix variations

If your data has inconsistent separators (SKU-8847 and SKU_8847 both appear), run two extraction passes — one for each variant — then use SplitForge Find & Replace to standardize the separator in the combined output.

What success looks like:

  • Only the code portion appears in the extracted column (not the prefix + code)
  • Rows without the prefix return blank cells
  • No surrounding text appears in the extracted value
  • If surrounding text appears, reduce the maximum match length in settings

How to Extract ASINs from Product Titles or Descriptions

Amazon Standard Identification Numbers follow a fixed 10-character format beginning with "B0" for physical products. This makes ASINs reliably extractable even without a surrounding text prefix — the format itself is the anchor.

GS1, the global supply chain standards body, maintains GTIN identifiers that are separate from and distinct from ASINs. Don't conflate them when setting up extraction for multi-marketplace catalogs.

Step 1: Use the ASIN format pattern

In ID/Code mode, enter "B0" as the prefix and set total length to exactly 10 characters (2-character prefix + 8 alphanumeric characters). Enable alphanumeric matching (letters and digits, case-insensitive).

Step 2: Verify the preview

ASINs appear in many contexts in product data — product titles, descriptions, bundle notes, and cross-reference fields. The preview will show every match. Check that all extracted values are exactly 10 characters and start with B0. Any other pattern is a false positive.

Step 3: Handle older B-format ASINs

Older ASINs sometimes start with "B" followed by 9 digits (e.g., B000123456 — 10 chars, starts with B but not B0). If your catalog includes older products, run a second extraction pass targeting the "B" prefix with digit-only characters and merge with the primary output.


How to Extract Barcodes (UPC, EAN, GTIN)

Barcodes are fixed-length numeric strings: UPC-A is 12 digits, EAN-13 is 13 digits, GTIN-14 is 14 digits. The extraction strategy is strict digit-count matching — no prefix needed. The challenge is false positives from other numeric strings (phone numbers, order IDs, date-time values) that happen to contain similar digit sequences.

The fix is strict exact-length enforcement. A phone number is 10–11 digits. A UPC is exactly 12. The one-digit difference is enough to separate them reliably.

Step 1: Choose the barcode standard

Determine which standard your catalog uses. US retail uses UPC-12. European and international products use EAN-13. Distribution and logistics often use GTIN-14. If your catalog mixes standards, extract each in a separate pass.

Step 2: Configure strict numeric matching

In ID/Code mode, set:

  • Character class: Digits only
  • Exact length: 12, 13, or 14 (strict exact match — not a range)
  • No prefix

The strict exact length is what separates barcodes from similar-length numeric codes. Never use a length range for barcode extraction.

Step 3: Validate with check digit awareness

UPC-12 and EAN-13 both include a check digit — the final digit is calculated from the preceding digits using the GS1 check digit algorithm, which applies a weighted modulo-10 calculation to the preceding 11 or 12 digits. SplitForge Pattern Extraction flags potential check digit failures in the preview when barcode mode is enabled. Any value that fails check digit validation is not a genuine barcode — it's a numeric string that happens to be the right length.


How to Extract Custom Internal Codes

Internal codes without consistent prefixes or fixed lengths are the hardest extraction case. The right approach depends on what you know about the code format.

If you know the exact length: Set exact length matching with the appropriate character class. A 6-character alphanumeric code can be reliably extracted if no other 6-character alphanumeric strings appear in the same column.

If you know a partial structural pattern: Even a partial anchor helps. "Codes always contain a hyphen after the third character" (e.g., A1B-234) narrows the match significantly. Use the structural pattern option to define this.

If the code has no distinguishing features: A manual review of the first 100 rows to identify consistent structure is faster than trying to tune extraction against nothing specific. Even "letters in positions 1-3, then digits" is a usable constraint.


Cleaning Extracted Codes

Extraction outputs often need minor cleanup before import. Three issues appear consistently:

Leading and trailing spaces: Cells where the code is surrounded by spaces in the source (" SKU-8847 "). Use SplitForge Data Cleaner to trim whitespace from the extracted column.

Inconsistent case: "SKU-8847" and "sku-8847" in the same column. Standardize to uppercase using the column transform option before import to avoid duplicate records caused by case differences.

Duplicate codes: If the same product appears in multiple rows with different descriptions, extraction produces duplicate code entries. Use SplitForge Remove Duplicates to deduplicate the extracted code column before importing to your catalog system.


Common Catalog Scenarios

Scenario 1: Supplier feed with SKU in the product title

Source column value: "Blue Widget 500ml [SKU-8847] Case of 12" Configuration: Bracket-enclosed SKU — prefix: "[SKU-", suffix anchor: "]" Extracted result: SKU-8847

Scenario 2: Amazon export with ASIN in the description

Source column value: "Premium Headphones (ASIN: B0CXY12345) — Compatible with iOS and Android" Configuration: ASIN prefix with colon — prefix: "ASIN: ", total length: 10 Extracted result: B0CXY12345

Scenario 3: Wholesale catalog with EAN-13 at the start of each row

Source column value: "0123456789012 — Blue Widget 500ml — Case of 12" Configuration: 13-digit numeric string, position: start of cell, exact length: 13 Extracted result: 0123456789012

Scenario 4: Internal catalog with no consistent format

Source: Mixed descriptions with alphanumeric codes scattered throughout, no visible pattern Approach: Export a 200-row sample first. Identify the most consistent structural feature — length, character set, or position. Define that as your extraction pattern and test on the sample before running on the full file.

For more on working with product catalogs, see our guide on cleaning product catalog CSVs for Shopify and WooCommerce.


Additional Resources

Product Identifier Standards:

Amazon Seller Reference:

Browser Processing:

  • MDN Web Workers API — How browser-based background processing handles large files without blocking

Related SplitForge Guides:


FAQ

Yes, using length-based and character-class-based matching. Define what you know about the code structure — number of characters, letters vs digits, any partial pattern — and configure the ID/Code mode accordingly. The preview lets you refine the pattern before running on the full file.

The tool returns the first match by default. Enable All Matches mode to return a comma-separated list of all matches found in the cell — useful when one product has multiple associated codes such as a UPC and a GTIN.

Highly accurate for the B0XXXXXXXX format — the 10-character fixed length and B0 prefix make false positives rare. For older B-format ASINs (B followed by 9 digits), accuracy depends on whether other 10-character strings starting with "B" appear in your data.

Yes. The tool has been tested on product catalogs up to 5 million rows. Processing time for 500,000 rows is typically under 10 seconds in Chrome on a modern machine.

Yes. Upload your Excel file directly — SplitForge converts it automatically before extraction. The output downloads as a CSV ready for import.

Special characters within codes (SKU-8847 or PART/A1047) are supported. Include the separator in your prefix definition or character class configuration. Hyphens and forward slashes are treated as literal characters in the pattern.

No. SplitForge creates a new output file. Your original catalog file is never modified, stored, or transmitted to any server.


Extract Product Codes From Any Catalog CSV

Extracts SKUs, ASINs, UPCs, EANs, GTINs, and custom internal codes
Works on codes with prefixes, fixed lengths, or structural patterns
Files process entirely in your browser — your product catalog never leaves your machine
No row limit — handles supplier feeds of any size

Continue Reading

More guides to help you work smarter with your data

ai-data-prep

AI-Ready Data Checklist: 10 Things to Verify Before Upload (2026)

Before uploading to ChatGPT, Claude, or a fine-tuning API, run through this 10-point checklist. UTF-8 encoding, clean headers, PII removed, size within limits.

Read More
ai-data-prep

Convert Excel to JSON for AI APIs and LLM Pipelines (2026)

AI APIs and LLM pipelines expect JSON, not spreadsheets. Fine-tuning needs JSONL; direct prompts take arrays. Convert locally — no upload, no conversion server.

Read More
ai-data-prep

Prepare Data for AI: The Complete Guide (Privacy-First, 2026)

How to prepare a CSV or Excel file for ChatGPT, Claude, or an AI API — encoding, PII, format, size, and privacy. The complete local-first prep workflow.

Read More