Why CSV is harder than it looks
CSV is the format people reach for when they want something simple. Then they hit the first comma inside a value, and the simple format starts revealing how little it actually agrees on.
The truth is there is no single CSV. There are dozens of slightly different conventions, and any tool that reads a file produced by another tool is making assumptions that may or may not hold. The traps below are the ones that catch real analysts in real workflows, every single week.
Common CSV bugs at a glance
| Bug | Symptom | Fix |
|---|---|---|
| Quote escaping | Stray quotes split rows | Pick one style (doubled or backslash) |
| Encoding mismatch | Garbled accents, replacement chars | Save and read as UTF-8 with BOM |
| Numeric coercion | Leading zeros vanish, IDs become floats | Read columns as strings |
| Trailing newline | Phantom empty last row | Strip blank lines on parse |
| Header inference | Wrong column types, off-by-one | Pass an explicit schema |
| Date timezone | Dates shift by a day | Store ISO 8601 with offset |
The delimiter is not always a comma
In most of Europe, the comma is the decimal separator. Excel in Spanish, French, German, Italian or Portuguese exports CSV using a semicolon by default, because using a comma would conflict with 1,5 meaning “one point five”.
The Tab Separated Values format is the same idea with tabs. Some pipelines use pipes (|) for the same reason. None of them are wrong. They are just not the same file.
When you receive a CSV and the parser puts everything in a single column, the first thing to check is the delimiter. Open the file in a plain text editor and look at the first few lines. Whatever sits between the values is your delimiter, no matter what the extension says.
Quoted fields and embedded commas
The standard rule is: if a value contains the delimiter, wrap the value in quotes. Hello, world becomes "Hello, world" in a comma separated file.
The trap is that the rule has variations. Some systems double the inner quote ("He said ""hi"""). Others escape it with a backslash ("He said \"hi\""). Excel uses the doubled version, most Unix tools accept both. A file that looks fine in one tool can become garbage in another.
When something does not parse, paste the file into the CSV to JSON converter on AldeaCode. It tries the common quoting variants and shows you which fields contain hidden commas, quotes, or newlines. The data never leaves your tab.
Embedded newlines: the silent destroyer
A multi line value (a customer note, a paragraph of feedback) can contain a real newline character inside it. The standard says you wrap that value in quotes and the newline becomes part of the field.
Most tools handle this. Some do not. Some count lines naively and end up reporting “1,000 rows” when the file actually has 800, with 200 newlines hidden inside cells. If your row count looks slightly off after every export, embedded newlines are the first suspect.
The fix is to strip newlines from the source data before exporting, or to verify that your parser handles quoted multi line fields. The text counter on AldeaCode is useful when you need a sanity check on row counts after the parse.
Encoding mismatches: the BOM and the accent
The single most common CSV bug is opening a file in the wrong encoding. UTF-8 has been the default everywhere for fifteen years, but Excel still saves CSV as Windows-1252 or similar by default on some systems.
If you open an UTF-8 file as Windows-1252, every accented letter becomes mojibake (pingüino becomes pingüino). If you do the reverse, the file fails to parse altogether.
The Byte Order Mark (BOM) is three invisible bytes that some Microsoft tools prepend to UTF-8 files. They tell the reader “this is UTF-8”. Most modern parsers handle the BOM correctly. Some parse it as part of the first column header, leaving you with a column called Name instead of Name. If your first column refuses to match by name, suspect a BOM.
Locale aware decimals and dates
1,234.56 in the United States is the same number as 1.234,56 in Spain. The same string parses as a different number depending on which locale your tool assumes.
Same with dates. 01/02/2026 is January 2nd in the US and February 1st almost everywhere else. There is no way to detect from the string which one was meant.
The robust fix is to use ISO formats inside the file: numbers without thousand separators, dates as 2026-02-01. The pragmatic fix is to ask the tool that produced the file what locale it used, and parse with that locale on purpose. Guessing is how you end up with January and February swapped in three out of five rows.
A 10 second pre flight check
Before you trust a CSV, do this:
- Open it in a plain text editor. Confirm the delimiter and the encoding.
- Count the rows by counting the line breaks, then count again with your parser. If the numbers differ, embedded newlines are hiding inside cells.
- Look at the first column header. If it has weird characters at the start, you have a BOM.
- Spot check a few rows for accented characters and decimal numbers. If anything looks wrong, the encoding or locale is wrong.
When you need to inspect a CSV without the rest of the pipeline, the CSV to JSON converter, the JSON formatter and the text counter on AldeaCode all run in your browser. The data never leaves your tab, no upload, no log.
CSV is one of those formats that works perfectly until it does not. The traps are not in the spec, they are in the gap between what your tool assumes and what the file actually contains. Ten seconds of inspection saves you an afternoon of debugging.