crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <>
Subject Re: Source creation with From.formattedFile
Date Thu, 23 Jan 2014 20:41:03 GMT
I think option 2 makes sense-- let's file a JIRA for it.


On Thu, Jan 23, 2014 at 12:37 PM, Durfey,Stephen

>  Recently I needed the ability to read in a CSV file with Crunch. Reading
> in the CSV file as a text file and then splitting at a delimiter wasn’t an
> option as the values in the CSV file could have had a new line character
> embedded inside quotes. So, myself and another guy on my team worked on
> creating our own custom input format to read from the file and properly
> generate splits at the end of a valid CSV line, rather than just the first
> new line character.
>  We started using From.formattedFile (I wasn’t aware of this until the
> user-guide, so thanks Josh for throwing that together) to create the
> TableSource we needed to read the file. After some testing we noticed that
> the getSplits method that we overrode in our InputFormat wasn’t being
> called. After some time debugging we found our way to ‘CrunchInputFormat’,
> and saw that our InputFormat was being replaced with the
> ‘CrunchCombineInputFormat’, and this was causing our splits to be
> incorrect. After disabling the config key so ‘CrunchCombineInputFormat’
> wasn’t used, everything was working as it should.
>  I have two possible requests/suggestions:
>    1. If the desired behavior is to use the CrunchCombineInputFormat by
>    default (even if developer specifies their own InputFormat), can this be
>    mentioned in the Source section in the user-guide? The config key for
>    disabling the combine is mentioned in the user-guide but not near the
>    Source information, so we were unaware of this behavior until we debugged
>    through the code.
>    2. If the developer uses From.formattedFile and specifically uses a
>    certain InputFormat, can that be honored and have the use of
>    CrunchCombineInputFormat be disabled without developer intervention?
>  I would think option 2 is preferred. My expectation was that my
> InputFormat would be used rather than the code defaulting to a different
> InputFormat.
>  Stephen Durfey
> Software Engineer|The Record
> 816-201-2689 |
>  CONFIDENTIALITY NOTICE This message and any included attachments are
> from Cerner Corporation and are intended only for the addressee. The
> information contained in this message is confidential and may constitute
> inside or non-public information under international, federal, or state
> securities laws. Unauthorized forwarding, printing, copying, distribution,
> or use of such information is strictly prohibited and may be unlawful. If
> you are not the addressee, please promptly delete this message and notify
> the sender of the delivery error by e-mail or you may call Cerner's
> corporate offices in Kansas City, Missouri, U.S.A at (+1) (816)221-1024.

Director of Data Science
Cloudera <>
Twitter: @josh_wills <>

View raw message