crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <>
Subject Re: CSVFileSource weirdness
Date Tue, 24 Jun 2014 22:28:39 GMT
Well, the test itself looks a little odd to me- why are you calling right after CSVFileSource(...))? There's
nothing for the pipeline to do at that point.


On Tue, Jun 24, 2014 at 10:53 AM, Champion,Mac <>

> Now that the CSVFileSource is in crunch 0.8.3, I’ve been trying to
> integrate it into the project that originally spurred its creation.
> However, I’m running into some weird issues.
> Reading and directly materializing and using a new CSVFileSource works
> fine, that scenario is already in the CSVFileSourceIT.
> But, as soon as I try to do something extra with that PCollection, say,
> use count() to turn it into a PTable, grab its key set, then print it out,
> everything falls apart
> New Test:
> Result:
> It seems that, when some additional actions are added to the pipeline, a
> CSVRecordReader is being created in CrunchRecordReader without going
> through the CSVFileSource or CSVInputFormat flow, where its various parsing
> options are normally configured.
> I was able to fix this issue by copying the "configure” method from
> CSVInputFormat and adding it to the beginning of the “initialize” method of
> the CSVRecordReader, which forces it to check the job config and configure
> itself if some options are null, but I don’t really feel like this is
> ideal. Did I miss something when I was designing this set of classes? Is
> this behavior expected?
> CONFIDENTIALITY NOTICE This message and any included attachments are from
> Cerner Corporation and are intended only for the addressee. The information
> contained in this message is confidential and may constitute inside or
> non-public information under international, federal, or state securities
> laws. Unauthorized forwarding, printing, copying, distribution, or use of
> such information is strictly prohibited and may be unlawful. If you are not
> the addressee, please promptly delete this message and notify the sender of
> the delivery error by e-mail or you may call Cerner's corporate offices in
> Kansas City, Missouri, U.S.A at (+1) (816)221-1024.

Director of Data Science
Cloudera <>
Twitter: @josh_wills <>

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message