crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <jwi...@cloudera.com>
Subject Re: CSVFileSource weirdness
Date Tue, 24 Jun 2014 22:28:39 GMT
Well, the test itself looks a little odd to me- why are you calling
pipeline.run() right after pipeline.read(new CSVFileSource(...))? There's
nothing for the pipeline to do at that point.

J


On Tue, Jun 24, 2014 at 10:53 AM, Champion,Mac <Mac.Champion@cerner.com>
wrote:

> Now that the CSVFileSource is in crunch 0.8.3, I’ve been trying to
> integrate it into the project that originally spurred its creation.
> However, I’m running into some weird issues.
>
> Reading and directly materializing and using a new CSVFileSource works
> fine, that scenario is already in the CSVFileSourceIT.
>
> https://github.com/apache/crunch/blob/apache-crunch-0.8.3/crunch-core/src/it/java/org/apache/crunch/io/text/csv/CSVFileSourceIT.java#L41
>
> But, as soon as I try to do something extra with that PCollection, say,
> use count() to turn it into a PTable, grab its key set, then print it out,
> everything falls apart
> New Test:
>
> https://github.com/champgm/crunch/blob/master/crunch-core/src/it/java/org/apache/crunch/io/text/csv/CSVFileSourceIT.java#L56
>
> Result:
> http://pastebin.com/f7iUQ73N
>
> It seems that, when some additional actions are added to the pipeline, a
> CSVRecordReader is being created in CrunchRecordReader without going
> through the CSVFileSource or CSVInputFormat flow, where its various parsing
> options are normally configured.
>
> I was able to fix this issue by copying the "configure” method from
> CSVInputFormat and adding it to the beginning of the “initialize” method of
> the CSVRecordReader, which forces it to check the job config and configure
> itself if some options are null, but I don’t really feel like this is
> ideal. Did I miss something when I was designing this set of classes? Is
> this behavior expected?
>
> CONFIDENTIALITY NOTICE This message and any included attachments are from
> Cerner Corporation and are intended only for the addressee. The information
> contained in this message is confidential and may constitute inside or
> non-public information under international, federal, or state securities
> laws. Unauthorized forwarding, printing, copying, distribution, or use of
> such information is strictly prohibited and may be unlawful. If you are not
> the addressee, please promptly delete this message and notify the sender of
> the delivery error by e-mail or you may call Cerner's corporate offices in
> Kansas City, Missouri, U.S.A at (+1) (816)221-1024.
>



-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message