crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <>
Subject Re: Source creation with From.formattedFile
Date Thu, 23 Jan 2014 20:52:13 GMT
And filed:

On Thu, Jan 23, 2014 at 12:41 PM, Josh Wills <> wrote:

> I think option 2 makes sense-- let's file a JIRA for it.
> J
> On Thu, Jan 23, 2014 at 12:37 PM, Durfey,Stephen <
>> wrote:
>>  Recently I needed the ability to read in a CSV file with Crunch.
>> Reading in the CSV file as a text file and then splitting at a delimiter
>> wasn’t an option as the values in the CSV file could have had a new line
>> character embedded inside quotes. So, myself and another guy on my team
>> worked on creating our own custom input format to read from the file and
>> properly generate splits at the end of a valid CSV line, rather than just
>> the first new line character.
>>  We started using From.formattedFile (I wasn’t aware of this until the
>> user-guide, so thanks Josh for throwing that together) to create the
>> TableSource we needed to read the file. After some testing we noticed that
>> the getSplits method that we overrode in our InputFormat wasn’t being
>> called. After some time debugging we found our way to ‘CrunchInputFormat’,
>> and saw that our InputFormat was being replaced with the
>> ‘CrunchCombineInputFormat’, and this was causing our splits to be
>> incorrect. After disabling the config key so ‘CrunchCombineInputFormat’
>> wasn’t used, everything was working as it should.
>>  I have two possible requests/suggestions:
>>    1. If the desired behavior is to use the CrunchCombineInputFormat by
>>    default (even if developer specifies their own InputFormat), can this be
>>    mentioned in the Source section in the user-guide? The config key for
>>    disabling the combine is mentioned in the user-guide but not near the
>>    Source information, so we were unaware of this behavior until we debugged
>>    through the code.
>>    2. If the developer uses From.formattedFile and specifically uses a
>>    certain InputFormat, can that be honored and have the use of
>>    CrunchCombineInputFormat be disabled without developer intervention?
>>  I would think option 2 is preferred. My expectation was that my
>> InputFormat would be used rather than the code defaulting to a different
>> InputFormat.
>>  Stephen Durfey
>> Software Engineer|The Record
>> 816-201-2689 |
>>  CONFIDENTIALITY NOTICE This message and any included attachments are
>> from Cerner Corporation and are intended only for the addressee. The
>> information contained in this message is confidential and may constitute
>> inside or non-public information under international, federal, or state
>> securities laws. Unauthorized forwarding, printing, copying, distribution,
>> or use of such information is strictly prohibited and may be unlawful. If
>> you are not the addressee, please promptly delete this message and notify
>> the sender of the delivery error by e-mail or you may call Cerner's
>> corporate offices in Kansas City, Missouri, U.S.A at (+1) (816)221-1024.
> --
> Director of Data Science
> Cloudera <>
> Twitter: @josh_wills <>

Director of Data Science
Cloudera <>
Twitter: @josh_wills <>

View raw message