crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matthias Friedrich <m...@mafr.de>
Subject Re: RFC 4180 compliant CSV format
Date Mon, 18 Mar 2013 17:07:48 GMT
On Monday, 2013-03-18, Josh Wills wrote:
> I personally try to steer people away from multi-line input formats b/c of
> how tedious they are to write/maintain.

Same here.

> To me, the question of supporting
> CSVs maps to a more general question about whether we should support some
> kind of named Record/Row type for processing data from
> CSV/Hive/Avro/PB/Thrift/etc. in a generic way. I could make arguments
> either way, which I'm happy to do if folks are interested, but I'd rather
> hear from other people first, esp. if anyone feels strongly about it.

I have used something like it in aggregation and machine learning
systems and I've grown quite fond it. It is basically a HashMap that
is partially immutable - once you add a value you can't change it
anymore. You can structure your system as a sequence of rules that
each adds fields to the record. This is quite flexible, you can work
with changing schemas and different sets of rules easily.

Regards,
  Matthias

> 
> On Mon, Mar 18, 2013 at 3:14 AM, Christian Tzolov <
> christian.tzolov@gmail.com> wrote:
> 
> > Hi,
> >
> > I am working on ETL projects that consume and produce data in the RFC4180
> > [1] CSV format. Although unreliable IMO, this RFC is used as an exchange
> > format by several Dutch government agencies.
> >
> > The RFC4180 spec supports multi-line fields (e.g. fields with line
> > breaks) and escaping of double quotes and delimiters within fields. Because
> > of the multi-line feature one can't use directly the
> > FileInputFormat/TextInputFormat or LineRecordReader implementations.
> > Furthermore as I see it the input splitting must be disabled (not sure if
> > any efficient splitting strategy is possible at all).
> >
> > There are several java libraries that provide some RFC4180 support [3]. For
> > Pig a slightly modified CSVExcelStorage UDF [2] seems to do the job (not
> > sure about the input splitting though). Also the "Hadoop in Practice"
> > example [4] does not support the multi-line fields.
> >
> > Has someone used similar 'multi-line fields' formats? I wonder how common
> > is this use case.
> >
> > Also shall we provide support for it in Crunch?
> >
> > Cheers,
> > Chris
> >
> > [1]  RFC 4180 - http://tools.ietf.org/html/rfc4180
> > [2]  PIG CVSExcelStorage UDF -
> >
> > http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/CSVExcelStorage.java
> > [3]  jCSV, OpenCSV, SuperCSV
> > [4]
> >
> > https://github.com/alexholmes/hadoop-book/blob/master/src/main/java/com/manning/hip/ch3/csv/CSVInputFormat.java
> >
> 
> 
> 
> -- 
> Director of Data Science
> Cloudera <http://www.cloudera.com>
> Twitter: @josh_wills <http://twitter.com/josh_wills>

Mime
View raw message