crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christian Tzolov <>
Subject RFC 4180 compliant CSV format
Date Mon, 18 Mar 2013 10:14:29 GMT

I am working on ETL projects that consume and produce data in the RFC4180
[1] CSV format. Although unreliable IMO, this RFC is used as an exchange
format by several Dutch government agencies.

The RFC4180 spec supports multi-line fields (e.g. fields with line
breaks) and escaping of double quotes and delimiters within fields. Because
of the multi-line feature one can't use directly the
FileInputFormat/TextInputFormat or LineRecordReader implementations.
Furthermore as I see it the input splitting must be disabled (not sure if
any efficient splitting strategy is possible at all).

There are several java libraries that provide some RFC4180 support [3]. For
Pig a slightly modified CSVExcelStorage UDF [2] seems to do the job (not
sure about the input splitting though). Also the "Hadoop in Practice"
example [4] does not support the multi-line fields.

Has someone used similar 'multi-line fields' formats? I wonder how common
is this use case.

Also shall we provide support for it in Crunch?


[1]  RFC 4180 -
[2]  PIG CVSExcelStorage UDF -
[3]  jCSV, OpenCSV, SuperCSV

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message