crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Champion,Mac" <Mac.Champ...@Cerner.com>
Subject I would like to contribute a new FileSource
Date Fri, 28 Feb 2014 18:04:03 GMT
Hi all,

Late last year, my team decided to change the way we processed large CSV
files. Up until then, we had been parsing them locally, sending the avros
to hdfs and processing them from there with crunch. This was irritating
and limiting in a few ways, so we decided to stop processing locally and
do the parsing/loading entirely in crunch. Everything went well using
TextFileSource until we ran into one of our files which contained a CSV
record with multiple lines in one field.

Here's a possible example of a record spanning multiple lines:
 

 "Champion, Mac","1234 Hoth St.
	Apartment 101
	Atlanta, GA
	64086","30","M","5/28/2010 12:00:00 AM","Just some guy"

To deal with this, I wrote a CSVInputFormat and CSVRecordReader that can
intelligently split and parse CSV files while maintaining the integrity of
each record. This works great, but using it a little messy.

 We have to read from the files like this:
 

 final PTable<Long, String> csvFile =
pipeline.read(disableFileCombine(From.formattedFile(outputPath,
CSVInputFormat.class, Writables.longs(), Writables.strings())));

 
 What I propose is that we extend FileSourceImpl in a way similar to
NLineFileSource and/or TextFileSource and submit the extension and its CSV
parsing logic as a patch to Crunch. Is this a valid idea for a new JIRA?
Would other users of Crunch find this ability to reliably parse out CSV
Records valuable? If so, I would like to log a JIRA and begin working on
it in the very near future.

CONFIDENTIALITY NOTICE This message and any included attachments are from Cerner Corporation
and are intended only for the addressee. The information contained in this message is confidential
and may constitute inside or non-public information under international, federal, or state
securities laws. Unauthorized forwarding, printing, copying, distribution, or use of such
information is strictly prohibited and may be unlawful. If you are not the addressee, please
promptly delete this message and notify the sender of the delivery error by e-mail or you
may call Cerner's corporate offices in Kansas City, Missouri, U.S.A at (+1) (816)221-1024.

Mime
View raw message