crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Harsh J <ha...@cloudera.com>
Subject Re: RFC 4180 compliant CSV format
Date Mon, 18 Mar 2013 23:44:16 GMT
Does OpenCSV (http://opencsv.sourceforge.net/#what-features) support
your format? There's a Hive wrapper for it:
http://ogrodnek.github.com/csv-serde and IIRC also a newer InputFormat
at https://github.com/mvallebr/CSVInputFormat (via
https://issues.apache.org/jira/browse/MAPREDUCE-2208).

On Mon, Mar 18, 2013 at 3:44 PM, Christian Tzolov
<christian.tzolov@gmail.com> wrote:
> Hi,
>
> I am working on ETL projects that consume and produce data in the RFC4180
> [1] CSV format. Although unreliable IMO, this RFC is used as an exchange
> format by several Dutch government agencies.
>
> The RFC4180 spec supports multi-line fields (e.g. fields with line
> breaks) and escaping of double quotes and delimiters within fields. Because
> of the multi-line feature one can't use directly the
> FileInputFormat/TextInputFormat or LineRecordReader implementations.
> Furthermore as I see it the input splitting must be disabled (not sure if
> any efficient splitting strategy is possible at all).
>
> There are several java libraries that provide some RFC4180 support [3]. For
> Pig a slightly modified CSVExcelStorage UDF [2] seems to do the job (not
> sure about the input splitting though). Also the "Hadoop in Practice"
> example [4] does not support the multi-line fields.
>
> Has someone used similar 'multi-line fields' formats? I wonder how common
> is this use case.
>
> Also shall we provide support for it in Crunch?
>
> Cheers,
> Chris
>
> [1]  RFC 4180 - http://tools.ietf.org/html/rfc4180
> [2]  PIG CVSExcelStorage UDF -
> http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/CSVExcelStorage.java
> [3]  jCSV, OpenCSV, SuperCSV
> [4]
> https://github.com/alexholmes/hadoop-book/blob/master/src/main/java/com/manning/hip/ch3/csv/CSVInputFormat.java



--
Harsh J

Mime
View raw message