crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Champion,Mac" <Mac.Champ...@Cerner.com>
Subject Re: CSVFileSource weirdness
Date Wed, 25 Jun 2014 14:33:43 GMT
1) One thing I really don¹t understand is how the tests which need extra
configuration (like the one which uses asterisks as quotation marks) can
pass if this configure method isn¹t being run. I know it¹s not; I¹ve even
commented out that entire method and, somehow, those files are still
parsed as expected. How is CSVInputFormat being configured in these cases?

2) That¹s how I fixed things at first. Though, at the time I felt that I
was duplicating the configuration logic, so it didn¹t seem very appealing
to me. 

Anyway, my confusion aside, I went with your first suggestion.
CSVInputFormat now implements Configurable and the overridden
setConf(Configuration) method handles the local configuration of the
various control characters. All tests pass as expected now. I will clean
up what I¹ve done (including pipeline.run in the tests) create a JIRA, and
attach a patch ASAP. Let me know if you think I¹ve overlooked anything or
if there are any test cases that I¹ve missed.

Thank you for taking a look,
-Mac 



On 6/24/14, 5:56 PM, "Josh Wills" <jwills@cloudera.com> wrote:

>Okay, looking a bit closer:
>
>1) CSVInputFormat specifies an implementation of configure(JobConf), but
>should extend the Configured abstract base class so that the system knows
>to call configure() on the class once it gets instantiated-- right now,
>that configure(JobConf) isn't being called, and so the params for the CSV
>file we use to configure the CSVRecordReader aren't getting set. It should
>also have the signature be configure(Configuration) instead of
>configure(JobConf).
>
>2) The other option would be to read the configuration parameters we need
>for the CSVRecordReader from the Configuration object that is attached to
>the TaskAttemptContext in the call to createRecordReader. That's what we
>do
>in AvroInputFormat to pull the schema out of the configuration:
>https://github.com/apache/crunch/blob/master/crunch-core/src/main/java/org
>/apache/crunch/types/avro/AvroInputFormat.java
>
>J
>
>
>On Tue, Jun 24, 2014 at 3:28 PM, Josh Wills <jwills@cloudera.com> wrote:
>
>> Well, the test itself looks a little odd to me- why are you calling
>> pipeline.run() right after pipeline.read(new CSVFileSource(...))?
>>There's
>> nothing for the pipeline to do at that point.
>>
>> J
>>
>>
>> On Tue, Jun 24, 2014 at 10:53 AM, Champion,Mac <Mac.Champion@cerner.com>
>> wrote:
>>
>>> Now that the CSVFileSource is in crunch 0.8.3, I¹ve been trying to
>>> integrate it into the project that originally spurred its creation.
>>> However, I¹m running into some weird issues.
>>>
>>> Reading and directly materializing and using a new CSVFileSource works
>>> fine, that scenario is already in the CSVFileSourceIT.
>>>
>>> 
>>>https://github.com/apache/crunch/blob/apache-crunch-0.8.3/crunch-core/sr
>>>c/it/java/org/apache/crunch/io/text/csv/CSVFileSourceIT.java#L41
>>>
>>> But, as soon as I try to do something extra with that PCollection, say,
>>> use count() to turn it into a PTable, grab its key set, then print it
>>>out,
>>> everything falls apart
>>> New Test:
>>>
>>> 
>>>https://github.com/champgm/crunch/blob/master/crunch-core/src/it/java/or
>>>g/apache/crunch/io/text/csv/CSVFileSourceIT.java#L56
>>>
>>> Result:
>>> http://pastebin.com/f7iUQ73N
>>>
>>> It seems that, when some additional actions are added to the pipeline,
>>>a
>>> CSVRecordReader is being created in CrunchRecordReader without going
>>> through the CSVFileSource or CSVInputFormat flow, where its various
>>>parsing
>>> options are normally configured.
>>>
>>> I was able to fix this issue by copying the "configure² method from
>>> CSVInputFormat and adding it to the beginning of the ³initialize²
>>>method of
>>> the CSVRecordReader, which forces it to check the job config and
>>>configure
>>> itself if some options are null, but I don¹t really feel like this is
>>> ideal. Did I miss something when I was designing this set of classes?
>>>Is
>>> this behavior expected?
>>>
>>> CONFIDENTIALITY NOTICE This message and any included attachments are
>>>from
>>> Cerner Corporation and are intended only for the addressee. The
>>>information
>>> contained in this message is confidential and may constitute inside or
>>> non-public information under international, federal, or state
>>>securities
>>> laws. Unauthorized forwarding, printing, copying, distribution, or use
>>>of
>>> such information is strictly prohibited and may be unlawful. If you
>>>are not
>>> the addressee, please promptly delete this message and notify the
>>>sender of
>>> the delivery error by e-mail or you may call Cerner's corporate
>>>offices in
>>> Kansas City, Missouri, U.S.A at (+1) (816)221-1024.
>>>
>>
>>
>>
>> --
>> Director of Data Science
>> Cloudera <http://www.cloudera.com>
>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>
>
>
>
>-- 
>Director of Data Science
>Cloudera <http://www.cloudera.com>
>Twitter: @josh_wills <http://twitter.com/josh_wills>


Mime
View raw message