crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christian Tzolov <christian.tzo...@gmail.com>
Subject RFC 4180 compliant CSV format
Date Mon, 18 Mar 2013 10:14:29 GMT
Hi,

I am working on ETL projects that consume and produce data in the RFC4180
[1] CSV format. Although unreliable IMO, this RFC is used as an exchange
format by several Dutch government agencies.

The RFC4180 spec supports multi-line fields (e.g. fields with line
breaks) and escaping of double quotes and delimiters within fields. Because
of the multi-line feature one can't use directly the
FileInputFormat/TextInputFormat or LineRecordReader implementations.
Furthermore as I see it the input splitting must be disabled (not sure if
any efficient splitting strategy is possible at all).

There are several java libraries that provide some RFC4180 support [3]. For
Pig a slightly modified CSVExcelStorage UDF [2] seems to do the job (not
sure about the input splitting though). Also the "Hadoop in Practice"
example [4] does not support the multi-line fields.

Has someone used similar 'multi-line fields' formats? I wonder how common
is this use case.

Also shall we provide support for it in Crunch?

Cheers,
Chris

[1]  RFC 4180 - http://tools.ietf.org/html/rfc4180
[2]  PIG CVSExcelStorage UDF -
http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/CSVExcelStorage.java
[3]  jCSV, OpenCSV, SuperCSV
[4]
https://github.com/alexholmes/hadoop-book/blob/master/src/main/java/com/manning/hip/ch3/csv/CSVInputFormat.java

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message