crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <jwi...@cloudera.com>
Subject Re: I would like to contribute a new FileSource
Date Fri, 28 Feb 2014 22:50:51 GMT
Hey Mac,

FWIW, I would be very happy to have it in the project and would be glad for
the contribution.

J


On Fri, Feb 28, 2014 at 10:04 AM, Champion,Mac <Mac.Champion@cerner.com>wrote:

> Hi all,
>
> Late last year, my team decided to change the way we processed large CSV
> files. Up until then, we had been parsing them locally, sending the avros
> to hdfs and processing them from there with crunch. This was irritating
> and limiting in a few ways, so we decided to stop processing locally and
> do the parsing/loading entirely in crunch. Everything went well using
> TextFileSource until we ran into one of our files which contained a CSV
> record with multiple lines in one field.
>
> Here's a possible example of a record spanning multiple lines:
>
>
>  "Champion, Mac","1234 Hoth St.
>         Apartment 101
>         Atlanta, GA
>         64086","30","M","5/28/2010 12:00:00 AM","Just some guy"
>
> To deal with this, I wrote a CSVInputFormat and CSVRecordReader that can
> intelligently split and parse CSV files while maintaining the integrity of
> each record. This works great, but using it a little messy.
>
>  We have to read from the files like this:
>
>
>  final PTable<Long, String> csvFile =
> pipeline.read(disableFileCombine(From.formattedFile(outputPath,
> CSVInputFormat.class, Writables.longs(), Writables.strings())));
>
>
>  What I propose is that we extend FileSourceImpl in a way similar to
> NLineFileSource and/or TextFileSource and submit the extension and its CSV
> parsing logic as a patch to Crunch. Is this a valid idea for a new JIRA?
> Would other users of Crunch find this ability to reliably parse out CSV
> Records valuable? If so, I would like to log a JIRA and begin working on
> it in the very near future.
>
> CONFIDENTIALITY NOTICE This message and any included attachments are from
> Cerner Corporation and are intended only for the addressee. The information
> contained in this message is confidential and may constitute inside or
> non-public information under international, federal, or state securities
> laws. Unauthorized forwarding, printing, copying, distribution, or use of
> such information is strictly prohibited and may be unlawful. If you are not
> the addressee, please promptly delete this message and notify the sender of
> the delivery error by e-mail or you may call Cerner's corporate offices in
> Kansas City, Missouri, U.S.A at (+1) (816)221-1024.
>



-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message