crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Champion,Mac" <>
Subject Re: CSVFileSource weirdness
Date Wed, 25 Jun 2014 14:33:43 GMT
1) One thing I really don¹t understand is how the tests which need extra
configuration (like the one which uses asterisks as quotation marks) can
pass if this configure method isn¹t being run. I know it¹s not; I¹ve even
commented out that entire method and, somehow, those files are still
parsed as expected. How is CSVInputFormat being configured in these cases?

2) That¹s how I fixed things at first. Though, at the time I felt that I
was duplicating the configuration logic, so it didn¹t seem very appealing
to me. 

Anyway, my confusion aside, I went with your first suggestion.
CSVInputFormat now implements Configurable and the overridden
setConf(Configuration) method handles the local configuration of the
various control characters. All tests pass as expected now. I will clean
up what I¹ve done (including in the tests) create a JIRA, and
attach a patch ASAP. Let me know if you think I¹ve overlooked anything or
if there are any test cases that I¹ve missed.

Thank you for taking a look,

On 6/24/14, 5:56 PM, "Josh Wills" <> wrote:

>Okay, looking a bit closer:
>1) CSVInputFormat specifies an implementation of configure(JobConf), but
>should extend the Configured abstract base class so that the system knows
>to call configure() on the class once it gets instantiated-- right now,
>that configure(JobConf) isn't being called, and so the params for the CSV
>file we use to configure the CSVRecordReader aren't getting set. It should
>also have the signature be configure(Configuration) instead of
>2) The other option would be to read the configuration parameters we need
>for the CSVRecordReader from the Configuration object that is attached to
>the TaskAttemptContext in the call to createRecordReader. That's what we
>in AvroInputFormat to pull the schema out of the configuration:
>On Tue, Jun 24, 2014 at 3:28 PM, Josh Wills <> wrote:
>> Well, the test itself looks a little odd to me- why are you calling
>> right after CSVFileSource(...))?
>> nothing for the pipeline to do at that point.
>> J
>> On Tue, Jun 24, 2014 at 10:53 AM, Champion,Mac <>
>> wrote:
>>> Now that the CSVFileSource is in crunch 0.8.3, I¹ve been trying to
>>> integrate it into the project that originally spurred its creation.
>>> However, I¹m running into some weird issues.
>>> Reading and directly materializing and using a new CSVFileSource works
>>> fine, that scenario is already in the CSVFileSourceIT.
>>> But, as soon as I try to do something extra with that PCollection, say,
>>> use count() to turn it into a PTable, grab its key set, then print it
>>> everything falls apart
>>> New Test:
>>> Result:
>>> It seems that, when some additional actions are added to the pipeline,
>>> CSVRecordReader is being created in CrunchRecordReader without going
>>> through the CSVFileSource or CSVInputFormat flow, where its various
>>> options are normally configured.
>>> I was able to fix this issue by copying the "configure² method from
>>> CSVInputFormat and adding it to the beginning of the ³initialize²
>>>method of
>>> the CSVRecordReader, which forces it to check the job config and
>>> itself if some options are null, but I don¹t really feel like this is
>>> ideal. Did I miss something when I was designing this set of classes?
>>> this behavior expected?
>>> CONFIDENTIALITY NOTICE This message and any included attachments are
>>> Cerner Corporation and are intended only for the addressee. The
>>> contained in this message is confidential and may constitute inside or
>>> non-public information under international, federal, or state
>>> laws. Unauthorized forwarding, printing, copying, distribution, or use
>>> such information is strictly prohibited and may be unlawful. If you
>>>are not
>>> the addressee, please promptly delete this message and notify the
>>>sender of
>>> the delivery error by e-mail or you may call Cerner's corporate
>>>offices in
>>> Kansas City, Missouri, U.S.A at (+1) (816)221-1024.
>> --
>> Director of Data Science
>> Cloudera <>
>> Twitter: @josh_wills <>
>Director of Data Science
>Cloudera <>
>Twitter: @josh_wills <>

View raw message