A comma-separated values (CSV) file is a delimited text file that uses a comma to separate values. Each line of the file is a data record.
Its simplicity and ubiquity make CSV an extremely popular way for organizations to exchange data both internally and externally. While many programs can’t read or write Excel spreadsheets, almost anything can read and write CSVs, and a human can open a CSV file in any text editor and understand roughly what it contains. Despite this ubiquity and ease of access, CSV is a wretched way to exchange data. The CSV format itself is notoriously inconsistent, with myriad competing and mutually-exclusive formats that often coexist within a single dataset (or, if you’re particularly unlucky, a single file). Exporting a dataset as a CSV robs it of a wealth of metadata that is very hard for the reader to reconstruct accurately, and many programs’ naïve CSV parsers ignore the metadata reconstruction problem entirely as a result. In practice, CSV’s human-readability is more of a liability than an asset.
CSVs often begin life as exported spreadsheets or table dumps from legacy databases, and often end life as a pile of undifferentiated files in a data lake, awaiting the restoration of their precious metadata so they can be organized and mined for insights. Much of the job of data preparation products is restoring the metadata lost when a spreadsheet is exported as a CSV. The world has lost a mind-boggling amount of time and information to CSV’s imprecision and under-specification.
I’ve spent many years battling CSVs in various capacities. As an engineer at Trifacta, a leading data preparation tool vendor, I saw numerous customers struggling under the burden of myriad CSVs that were the product of even more numerous Excel spreadsheet exports and legacy database dumps. As an engineering leader in biotech, my team struggled to ingest CSVs from research institutions and hospital networks who gave us sample data as CSVs more often than not. As a consultant, I’ve written more than one internal system that attempts to reconstruct the metadata of a CSV of unknown provenance using a combination of heuristics and brute force. In short, CSV is a scourge that has followed me throughout my career.
In recent years, it’s become much easier for organizations to exchange data using metadata-rich, clearly-defined, and well-structured file formats. Some organizations even directly share their live datasets with one another. Despite these advances, CSV is still the path of least resistance and, as a consequence, it remains the lingua franca of intra- and inter-organizational data exchange. The time has long passed to retire CSV and replace it with something better.
The CSV format itself dates at least as far back as the early 1970s. Tired of having to manually align their input data to punch-cards’ columns, FORTRAN users quickly embraced “list-directed input/output”, a format in which (according to the FORTRAN IV manual for VM/370) “input entries are specified by blanks or commas, with successive commas indicating values to be omitted”. Most of us don’t use punch-cards anymore, but that ease of authorship remains one of CSV’s most attractive qualities. CSVs can be read and written by just about anything, even if that thing doesn’t know about the CSV format itself.
CSVs are easy to read and write because the format - if it can even be called a format - is extremely simple. At its most basic, CSVs have only two delimiting characters: a comma to delimit columns, and a newline to delimit rows. In principle, reading such a file could hardly be more simple: scan the file one character at a time, building up a value’s characters in a buffer until you encounter a comma, and adding completed values to a row until you encounter a newline.