crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Durfey,Stephen" <Stephen.Dur...@Cerner.com>
Subject Source creation with From.formattedFile
Date Thu, 23 Jan 2014 20:37:41 GMT
Recently I needed the ability to read in a CSV file with Crunch. Reading in the CSV file as
a text file and then splitting at a delimiter wasn’t an option as the values in the CSV
file could have had a new line character embedded inside quotes. So, myself and another guy
on my team worked on creating our own custom input format to read from the file and properly
generate splits at the end of a valid CSV line, rather than just the first new line character.

We started using From.formattedFile (I wasn’t aware of this until the user-guide, so thanks
Josh for throwing that together) to create the TableSource we needed to read the file. After
some testing we noticed that the getSplits method that we overrode in our InputFormat wasn’t
being called. After some time debugging we found our way to ‘CrunchInputFormat’, and saw
that our InputFormat was being replaced with the ‘CrunchCombineInputFormat’, and this
was causing our splits to be incorrect. After disabling the config key so ‘CrunchCombineInputFormat’
wasn’t used, everything was working as it should.

I have two possible requests/suggestions:

  1.  If the desired behavior is to use the CrunchCombineInputFormat by default (even if developer
specifies their own InputFormat), can this be mentioned in the Source section in the user-guide?
The config key for disabling the combine is mentioned in the user-guide but not near the Source
information, so we were unaware of this behavior until we debugged through the code.
  2.  If the developer uses From.formattedFile and specifically uses a certain InputFormat,
can that be honored and have the use of CrunchCombineInputFormat be disabled without developer
intervention?

I would think option 2 is preferred. My expectation was that my InputFormat would be used
rather than the code defaulting to a different InputFormat.

Stephen Durfey
Software Engineer|The Record
816-201-2689 | Stephen.Durfey@cerner.com

CONFIDENTIALITY NOTICE This message and any included attachments are from Cerner Corporation
and are intended only for the addressee. The information contained in this message is confidential
and may constitute inside or non-public information under international, federal, or state
securities laws. Unauthorized forwarding, printing, copying, distribution, or use of such
information is strictly prohibited and may be unlawful. If you are not the addressee, please
promptly delete this message and notify the sender of the delivery error by e-mail or you
may call Cerner's corporate offices in Kansas City, Missouri, U.S.A at (+1) (816)221-1024.

Mime
View raw message