crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Champion,Mac" <Mac.Champ...@Cerner.com>
Subject CSVFileSource weirdness
Date Tue, 24 Jun 2014 17:53:10 GMT
Now that the CSVFileSource is in crunch 0.8.3, I’ve been trying to integrate it into the
project that originally spurred its creation. However, I’m running into some weird issues.

Reading and directly materializing and using a new CSVFileSource works fine, that scenario
is already in the CSVFileSourceIT.
https://github.com/apache/crunch/blob/apache-crunch-0.8.3/crunch-core/src/it/java/org/apache/crunch/io/text/csv/CSVFileSourceIT.java#L41

But, as soon as I try to do something extra with that PCollection, say, use count() to turn
it into a PTable, grab its key set, then print it out, everything falls apart
New Test:
https://github.com/champgm/crunch/blob/master/crunch-core/src/it/java/org/apache/crunch/io/text/csv/CSVFileSourceIT.java#L56

Result:
http://pastebin.com/f7iUQ73N

It seems that, when some additional actions are added to the pipeline, a CSVRecordReader is
being created in CrunchRecordReader without going through the CSVFileSource or CSVInputFormat
flow, where its various parsing options are normally configured.

I was able to fix this issue by copying the "configure” method from CSVInputFormat and adding
it to the beginning of the “initialize” method of the CSVRecordReader, which forces it
to check the job config and configure itself if some options are null, but I don’t really
feel like this is ideal. Did I miss something when I was designing this set of classes? Is
this behavior expected?

CONFIDENTIALITY NOTICE This message and any included attachments are from Cerner Corporation
and are intended only for the addressee. The information contained in this message is confidential
and may constitute inside or non-public information under international, federal, or state
securities laws. Unauthorized forwarding, printing, copying, distribution, or use of such
information is strictly prohibited and may be unlawful. If you are not the addressee, please
promptly delete this message and notify the sender of the delivery error by e-mail or you
may call Cerner's corporate offices in Kansas City, Missouri, U.S.A at (+1) (816)221-1024.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message