crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <>
Subject Re: RFC 4180 compliant CSV format
Date Mon, 18 Mar 2013 16:51:46 GMT
I personally try to steer people away from multi-line input formats b/c of
how tedious they are to write/maintain. To me, the question of supporting
CSVs maps to a more general question about whether we should support some
kind of named Record/Row type for processing data from
CSV/Hive/Avro/PB/Thrift/etc. in a generic way. I could make arguments
either way, which I'm happy to do if folks are interested, but I'd rather
hear from other people first, esp. if anyone feels strongly about it.


On Mon, Mar 18, 2013 at 3:14 AM, Christian Tzolov <> wrote:

> Hi,
> I am working on ETL projects that consume and produce data in the RFC4180
> [1] CSV format. Although unreliable IMO, this RFC is used as an exchange
> format by several Dutch government agencies.
> The RFC4180 spec supports multi-line fields (e.g. fields with line
> breaks) and escaping of double quotes and delimiters within fields. Because
> of the multi-line feature one can't use directly the
> FileInputFormat/TextInputFormat or LineRecordReader implementations.
> Furthermore as I see it the input splitting must be disabled (not sure if
> any efficient splitting strategy is possible at all).
> There are several java libraries that provide some RFC4180 support [3]. For
> Pig a slightly modified CSVExcelStorage UDF [2] seems to do the job (not
> sure about the input splitting though). Also the "Hadoop in Practice"
> example [4] does not support the multi-line fields.
> Has someone used similar 'multi-line fields' formats? I wonder how common
> is this use case.
> Also shall we provide support for it in Crunch?
> Cheers,
> Chris
> [1]  RFC 4180 -
> [2]  PIG CVSExcelStorage UDF -
> [3]  jCSV, OpenCSV, SuperCSV
> [4]

Director of Data Science
Cloudera <>
Twitter: @josh_wills <>

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message