crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <jwi...@cloudera.com>
Subject Re: CSVFileSource weirdness
Date Tue, 24 Jun 2014 22:56:42 GMT
Okay, looking a bit closer:

1) CSVInputFormat specifies an implementation of configure(JobConf), but
should extend the Configured abstract base class so that the system knows
to call configure() on the class once it gets instantiated-- right now,
that configure(JobConf) isn't being called, and so the params for the CSV
file we use to configure the CSVRecordReader aren't getting set. It should
also have the signature be configure(Configuration) instead of
configure(JobConf).

2) The other option would be to read the configuration parameters we need
for the CSVRecordReader from the Configuration object that is attached to
the TaskAttemptContext in the call to createRecordReader. That's what we do
in AvroInputFormat to pull the schema out of the configuration:
https://github.com/apache/crunch/blob/master/crunch-core/src/main/java/org/apache/crunch/types/avro/AvroInputFormat.java

J


On Tue, Jun 24, 2014 at 3:28 PM, Josh Wills <jwills@cloudera.com> wrote:

> Well, the test itself looks a little odd to me- why are you calling
> pipeline.run() right after pipeline.read(new CSVFileSource(...))? There's
> nothing for the pipeline to do at that point.
>
> J
>
>
> On Tue, Jun 24, 2014 at 10:53 AM, Champion,Mac <Mac.Champion@cerner.com>
> wrote:
>
>> Now that the CSVFileSource is in crunch 0.8.3, I’ve been trying to
>> integrate it into the project that originally spurred its creation.
>> However, I’m running into some weird issues.
>>
>> Reading and directly materializing and using a new CSVFileSource works
>> fine, that scenario is already in the CSVFileSourceIT.
>>
>> https://github.com/apache/crunch/blob/apache-crunch-0.8.3/crunch-core/src/it/java/org/apache/crunch/io/text/csv/CSVFileSourceIT.java#L41
>>
>> But, as soon as I try to do something extra with that PCollection, say,
>> use count() to turn it into a PTable, grab its key set, then print it out,
>> everything falls apart
>> New Test:
>>
>> https://github.com/champgm/crunch/blob/master/crunch-core/src/it/java/org/apache/crunch/io/text/csv/CSVFileSourceIT.java#L56
>>
>> Result:
>> http://pastebin.com/f7iUQ73N
>>
>> It seems that, when some additional actions are added to the pipeline, a
>> CSVRecordReader is being created in CrunchRecordReader without going
>> through the CSVFileSource or CSVInputFormat flow, where its various parsing
>> options are normally configured.
>>
>> I was able to fix this issue by copying the "configure” method from
>> CSVInputFormat and adding it to the beginning of the “initialize” method of
>> the CSVRecordReader, which forces it to check the job config and configure
>> itself if some options are null, but I don’t really feel like this is
>> ideal. Did I miss something when I was designing this set of classes? Is
>> this behavior expected?
>>
>> CONFIDENTIALITY NOTICE This message and any included attachments are from
>> Cerner Corporation and are intended only for the addressee. The information
>> contained in this message is confidential and may constitute inside or
>> non-public information under international, federal, or state securities
>> laws. Unauthorized forwarding, printing, copying, distribution, or use of
>> such information is strictly prohibited and may be unlawful. If you are not
>> the addressee, please promptly delete this message and notify the sender of
>> the delivery error by e-mail or you may call Cerner's corporate offices in
>> Kansas City, Missouri, U.S.A at (+1) (816)221-1024.
>>
>
>
>
> --
> Director of Data Science
> Cloudera <http://www.cloudera.com>
> Twitter: @josh_wills <http://twitter.com/josh_wills>
>



-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message