crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <jwi...@cloudera.com>
Subject Re: RFC 4180 compliant CSV format
Date Mon, 18 Mar 2013 19:52:35 GMT
On Mon, Mar 18, 2013 at 10:07 AM, Matthias Friedrich <matt@mafr.de> wrote:

> On Monday, 2013-03-18, Josh Wills wrote:
> > I personally try to steer people away from multi-line input formats b/c
> of
> > how tedious they are to write/maintain.
>
> Same here.
>
> > To me, the question of supporting
> > CSVs maps to a more general question about whether we should support some
> > kind of named Record/Row type for processing data from
> > CSV/Hive/Avro/PB/Thrift/etc. in a generic way. I could make arguments
> > either way, which I'm happy to do if folks are interested, but I'd rather
> > hear from other people first, esp. if anyone feels strongly about it.
>
> I have used something like it in aggregation and machine learning
> systems and I've grown quite fond it. It is basically a HashMap that
> is partially immutable - once you add a value you can't change it
> anymore. You can structure your system as a sequence of rules that
> each adds fields to the record. This is quite flexible, you can work
> with changing schemas and different sets of rules easily.
>

I've been noodling on such a system for some ML tools I'm writing on top of
Crunch. I'll be happy to import the code (or whatever pieces of it seem
generally useful) if there's interest. I'm not quite ready to release it,
but I'll ping the dev list when it's published.


>
> Regards,
>   Matthias
>
> >
> > On Mon, Mar 18, 2013 at 3:14 AM, Christian Tzolov <
> > christian.tzolov@gmail.com> wrote:
> >
> > > Hi,
> > >
> > > I am working on ETL projects that consume and produce data in the
> RFC4180
> > > [1] CSV format. Although unreliable IMO, this RFC is used as an
> exchange
> > > format by several Dutch government agencies.
> > >
> > > The RFC4180 spec supports multi-line fields (e.g. fields with line
> > > breaks) and escaping of double quotes and delimiters within fields.
> Because
> > > of the multi-line feature one can't use directly the
> > > FileInputFormat/TextInputFormat or LineRecordReader implementations.
> > > Furthermore as I see it the input splitting must be disabled (not sure
> if
> > > any efficient splitting strategy is possible at all).
> > >
> > > There are several java libraries that provide some RFC4180 support
> [3]. For
> > > Pig a slightly modified CSVExcelStorage UDF [2] seems to do the job
> (not
> > > sure about the input splitting though). Also the "Hadoop in Practice"
> > > example [4] does not support the multi-line fields.
> > >
> > > Has someone used similar 'multi-line fields' formats? I wonder how
> common
> > > is this use case.
> > >
> > > Also shall we provide support for it in Crunch?
> > >
> > > Cheers,
> > > Chris
> > >
> > > [1]  RFC 4180 - http://tools.ietf.org/html/rfc4180
> > > [2]  PIG CVSExcelStorage UDF -
> > >
> > >
> http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/CSVExcelStorage.java
> > > [3]  jCSV, OpenCSV, SuperCSV
> > > [4]
> > >
> > >
> https://github.com/alexholmes/hadoop-book/blob/master/src/main/java/com/manning/hip/ch3/csv/CSVInputFormat.java
> > >
> >
> >
> >
> > --
> > Director of Data Science
> > Cloudera <http://www.cloudera.com>
> > Twitter: @josh_wills <http://twitter.com/josh_wills>
>



-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message