crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Barry,Nathan" <NBA...@CERNER.COM>
Subject Re: Enhancement to CSV input format?
Date Thu, 07 May 2015 19:57:44 GMT
Agree we don’t want to compile dozens/hundreds of alternate flows to deal
with different flavors of bad CSV files.  Though over the years I
regularly see the unescaped quote problem in data coming from many
systems, so we thought we’d take a stab at enhancing the CSV Input to be
more tolerant to that specific scenario.

Developers can always fall back to the normal Text File input for handling
CSV files which will treat each line as a record, but that format can’t
handle CSVs with newlines embedded within a quoted field - which is why
the CSV Input format came to life.

The current implementation of the CSV Input is clearly not the best way to
handle the unescaped quote within a field, so we’re looking to find a
better way to handle this single scenario in cases when we still need to
handle newlines within a field, hence needing to use the CSV Input.

On 5/7/15, 2:02 PM, "Jinal Shah" <jinalshah2007@gmail.com> wrote:

>Hi Nathan,
>
>I may be completely wrong on this but this seems to me more of a specific
>case. If the data passed is not refined properly then the data should be
>handled properly by the developer. But there would be many such cases that
>may pile up based on these changes. So I think it should be developers
>choice on how the data should be handled rather than we making a generic
>change on the code to handle different such use cases. If anyone has any
>thoughts do share on this.
>
>
>- Jinal
>
>On Thu, May 7, 2015 at 9:17 AM, Barry,Nathan <NBARRY@cerner.com> wrote:
>
>> We’ll get a Crunch JIRA logged soon, but at the moment we are debating
>>how
>> the CSV Input should attempt to recover from the unescaped quote, should
>> the input
>>
>> 1) look for a proper end quote (where the next char is a delimiter, EOL
>>or
>> EOF) as recover at that point, mid-line; else fall back to the next EOL
>> Or
>> 2) look for the next EOL and recover at that point
>>
>> Example scenario, invalid input text:
>> ---------------------------------------------
>>
>> this,is,a,"line with 1" problems","in
>> it"
>> this,is,another,line of,text
>>
>> Output using 1 (With ---- as the record breaks)
>> ---------------------------------------------
>>
>> this,is,a,"line with 1" problems","in
>> it”
>> ---------------------------------------------
>> this,is,another,line of,text
>> ---------------------------------------------
>>
>>
>>
>> Output using 2
>> ---------------------------------------------
>>
>> this,is,a,"line with 1" problems",”in
>> ---------------------------------------------
>>
>> it"
>> this,is,another,line of,text
>> ---------------------------------------------
>>
>>
>>
>>
>>
>> On 5/6/15, 9:12 AM, "Josh Wills" <jwills@cloudera.com> wrote:
>>
>> >On Wed, May 6, 2015 at 3:04 PM, Barry,Nathan <NBARRY@cerner.com> wrote:
>> >
>> >> Correct me if I am wrong, but with the current code when an unescaped
>> >> quote is encountered the code doesn¹t always blow up, rather it
>>become
>> >>out
>> >> of sequence with the open/close quotes which will often lead to
>> >>scenarios
>> >> where both the field delimiters and eol markers are now treated as if
>> >>they
>> >> are inside a quoted attribute (i.e. ignored) resulting in a CSV
>>record
>> >> that is potentially huge, with the size being determined when the
>>code
>> >> finds the next unescaped quote, at which point the code will honor
>>the
>> >> next field delimited/eol marker.
>> >>
>> >
>> >That seems like the sort of thing worth fixing in the core, IMHO.
>> >
>> >
>> >>
>> >> So in our bad file examples CSV records were created with individual
>> >> record sizes in the 10s of MB rather than the expected ~3-4KB.
>> >>
>> >> What we were looking to do was to:
>> >> - when currently in a quoted field
>> >> - if we find another quote
>> >> - look to the next character and see:
>> >> - if it¹s a delimiter, eol or eof marker - then close the quote and
>>keep
>> >> processing normally
>> >> - if not a delimited, eol or eof marker - then we have a bad record,
>>so
>> >> ignore all quotes & delimiters and simply look for the next eol/eof
>>and
>> >> break the record there
>> >>
>> >> The thought being that the 1 bad record won¹t corrupt the entire
>>file or
>> >> corrupt the record splits; though the consumers of each record would
>> >>then
>> >> encounter errors when trying to parse the record, but they can then
>> >> determine the best course of action:  ignore it, reject it, reject
>>the
>> >> whole file, etc.
>> >>
>> >> Nathan
>> >>
>> >> On 5/5/15, 9:30 AM, "Champion,Mac" <Mac.Champion@Cerner.com> wrote:
>> >>
>> >> >Some users of the CSV Input Format at Cerner had some issues with
>>CSV
>> >> >files from clients where there were stray, unescaped double-quotes
>> >>inside
>> >> >of fields (ostensibly representing inches). Some bureaucratic stuff
>> >> >prevented us from getting those files reliably cleaned up, so we
>> >> >brainstormed and figured out a way to make the CSV Input Format
>>able to
>> >> >ignore the stray quotes and pass them forward to be handled by
>>whatever
>> >> >parsing solution comes later. We are working on implementing this
>>into
>> >> >our copy of the input format and it seems to be working so far.
>> >> >
>> >> >My question is, is this something that we should log a JIRA for and
>> >> >submit our work to Crunch as well? It¹s handy in our case, but the
>> >>files
>> >> >are truly malformed and not following the CSV standards. Should the
>> >> >CSVInputFormat have configurable options to be able to handle
>>malformed
>> >> >files and pass bad records forward, or is the current behavior
>>(blow up
>> >> >and give some info about where the bad records start) the way it
>>truly
>> >> >should behave?
>> >> >
>> >> >Thanks for your input,
>> >> >Mac
>> >> >
>> >> >CONFIDENTIALITY NOTICE This message and any included attachments are
>> >>from
>> >> >Cerner Corporation and are intended only for the addressee. The
>> >> >information contained in this message is confidential and may
>> >>constitute
>> >> >inside or non-public information under international, federal, or
>>state
>> >> >securities laws. Unauthorized forwarding, printing, copying,
>> >> >distribution, or use of such information is strictly prohibited and
>>may
>> >> >be unlawful. If you are not the addressee, please promptly delete
>>this
>> >> >message and notify the sender of the delivery error by e-mail or you
>> >>may
>> >> >call Cerner's corporate offices in Kansas City, Missouri, U.S.A at
>>(+1)
>> >> >(816)221-1024.
>> >>
>> >>
>> >
>> >
>> >--
>> >Director of Data Science
>> >Cloudera <http://www.cloudera.com>
>> >Twitter: @josh_wills <http://twitter.com/josh_wills>
>>
>>

Mime
View raw message