crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <jwi...@cloudera.com>
Subject Re: RFC 4180 compliant CSV format
Date Wed, 20 Mar 2013 16:37:47 GMT
Inlined.

On Tue, Mar 19, 2013 at 6:54 AM, Christian Tzolov <
christian.tzolov@gmail.com> wrote:

> @Josh, most of the time I can manage to steer away from multiline records
> but with gov. organisations it is difficult to alter what they
> have considered as a 'standard'.


> Can you please elaborate on your idea for named records/rows?
>

Yeah, I posted a library of Crunch-based tools for machine learning that
I've been working on for the past couple of months:

https://github.com/cloudera/ml

The core module defines a Record interface that should eventually support
working w/Avro records, HCatalog records, CSV files, and even Vectors--
anything that can be made to look/feel like a typed tuple of values, and
the parallel module defines associated PTypes for the various
implementations. I don't have the sophistication on the APIs that Matthias
mentioned (in terms of evolving immutable objects), but that is the
direction I expect to go in.

J


> @Harsh, thanks for the references. I remember I had some issues with
> OpenCSV (either the iterator suport or some RFC4180 limitations). But I
> would check the other sources.
>
> Thanks,
> Chris
>
>
>
> On Tue, Mar 19, 2013 at 12:44 AM, Harsh J <harsh@cloudera.com> wrote:
>
> > Does OpenCSV (http://opencsv.sourceforge.net/#what-features) support
> > your format? There's a Hive wrapper for it:
> > http://ogrodnek.github.com/csv-serde and IIRC also a newer InputFormat
> > at https://github.com/mvallebr/CSVInputFormat (via
> > https://issues.apache.org/jira/browse/MAPREDUCE-2208).
> >
> > On Mon, Mar 18, 2013 at 3:44 PM, Christian Tzolov
> > <christian.tzolov@gmail.com> wrote:
> > > Hi,
> > >
> > > I am working on ETL projects that consume and produce data in the
> RFC4180
> > > [1] CSV format. Although unreliable IMO, this RFC is used as an
> exchange
> > > format by several Dutch government agencies.
> > >
> > > The RFC4180 spec supports multi-line fields (e.g. fields with line
> > > breaks) and escaping of double quotes and delimiters within fields.
> > Because
> > > of the multi-line feature one can't use directly the
> > > FileInputFormat/TextInputFormat or LineRecordReader implementations.
> > > Furthermore as I see it the input splitting must be disabled (not sure
> if
> > > any efficient splitting strategy is possible at all).
> > >
> > > There are several java libraries that provide some RFC4180 support [3].
> > For
> > > Pig a slightly modified CSVExcelStorage UDF [2] seems to do the job
> (not
> > > sure about the input splitting though). Also the "Hadoop in Practice"
> > > example [4] does not support the multi-line fields.
> > >
> > > Has someone used similar 'multi-line fields' formats? I wonder how
> common
> > > is this use case.
> > >
> > > Also shall we provide support for it in Crunch?
> > >
> > > Cheers,
> > > Chris
> > >
> > > [1]  RFC 4180 - http://tools.ietf.org/html/rfc4180
> > > [2]  PIG CVSExcelStorage UDF -
> > >
> >
> http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/CSVExcelStorage.java
> > > [3]  jCSV, OpenCSV, SuperCSV
> > > [4]
> > >
> >
> https://github.com/alexholmes/hadoop-book/blob/master/src/main/java/com/manning/hip/ch3/csv/CSVInputFormat.java
> >
> >
> >
> > --
> > Harsh J
> >
>



-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message