crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <jwi...@cloudera.com>
Subject Re: Source creation with From.formattedFile
Date Thu, 23 Jan 2014 20:52:13 GMT
And filed: https://issues.apache.org/jira/browse/CRUNCH-331


On Thu, Jan 23, 2014 at 12:41 PM, Josh Wills <jwills@cloudera.com> wrote:

> I think option 2 makes sense-- let's file a JIRA for it.
>
> J
>
>
> On Thu, Jan 23, 2014 at 12:37 PM, Durfey,Stephen <
> Stephen.Durfey@cerner.com> wrote:
>
>>  Recently I needed the ability to read in a CSV file with Crunch.
>> Reading in the CSV file as a text file and then splitting at a delimiter
>> wasn’t an option as the values in the CSV file could have had a new line
>> character embedded inside quotes. So, myself and another guy on my team
>> worked on creating our own custom input format to read from the file and
>> properly generate splits at the end of a valid CSV line, rather than just
>> the first new line character.
>>
>>  We started using From.formattedFile (I wasn’t aware of this until the
>> user-guide, so thanks Josh for throwing that together) to create the
>> TableSource we needed to read the file. After some testing we noticed that
>> the getSplits method that we overrode in our InputFormat wasn’t being
>> called. After some time debugging we found our way to ‘CrunchInputFormat’,
>> and saw that our InputFormat was being replaced with the
>> ‘CrunchCombineInputFormat’, and this was causing our splits to be
>> incorrect. After disabling the config key so ‘CrunchCombineInputFormat’
>> wasn’t used, everything was working as it should.
>>
>>  I have two possible requests/suggestions:
>>
>>    1. If the desired behavior is to use the CrunchCombineInputFormat by
>>    default (even if developer specifies their own InputFormat), can this be
>>    mentioned in the Source section in the user-guide? The config key for
>>    disabling the combine is mentioned in the user-guide but not near the
>>    Source information, so we were unaware of this behavior until we debugged
>>    through the code.
>>    2. If the developer uses From.formattedFile and specifically uses a
>>    certain InputFormat, can that be honored and have the use of
>>    CrunchCombineInputFormat be disabled without developer intervention?
>>
>>
>>  I would think option 2 is preferred. My expectation was that my
>> InputFormat would be used rather than the code defaulting to a different
>> InputFormat.
>>
>>  Stephen Durfey
>> Software Engineer|The Record
>> 816-201-2689 | Stephen.Durfey@cerner.com
>>  CONFIDENTIALITY NOTICE This message and any included attachments are
>> from Cerner Corporation and are intended only for the addressee. The
>> information contained in this message is confidential and may constitute
>> inside or non-public information under international, federal, or state
>> securities laws. Unauthorized forwarding, printing, copying, distribution,
>> or use of such information is strictly prohibited and may be unlawful. If
>> you are not the addressee, please promptly delete this message and notify
>> the sender of the delivery error by e-mail or you may call Cerner's
>> corporate offices in Kansas City, Missouri, U.S.A at (+1) (816)221-1024.
>>
>
>
>
> --
> Director of Data Science
> Cloudera <http://www.cloudera.com>
> Twitter: @josh_wills <http://twitter.com/josh_wills>
>



-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Mime
View raw message