crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jinal Shah <jinalshah2...@gmail.com>
Subject Re: Enhancement to CSV input format?
Date Thu, 07 May 2015 19:02:16 GMT
Hi Nathan,

I may be completely wrong on this but this seems to me more of a specific
case. If the data passed is not refined properly then the data should be
handled properly by the developer. But there would be many such cases that
may pile up based on these changes. So I think it should be developers
choice on how the data should be handled rather than we making a generic
change on the code to handle different such use cases. If anyone has any
thoughts do share on this.


- Jinal

On Thu, May 7, 2015 at 9:17 AM, Barry,Nathan <NBARRY@cerner.com> wrote:

> We’ll get a Crunch JIRA logged soon, but at the moment we are debating how
> the CSV Input should attempt to recover from the unescaped quote, should
> the input
>
> 1) look for a proper end quote (where the next char is a delimiter, EOL or
> EOF) as recover at that point, mid-line; else fall back to the next EOL
> Or
> 2) look for the next EOL and recover at that point
>
> Example scenario, invalid input text:
> ---------------------------------------------
>
> this,is,a,"line with 1" problems","in
> it"
> this,is,another,line of,text
>
> Output using 1 (With ---- as the record breaks)
> ---------------------------------------------
>
> this,is,a,"line with 1" problems","in
> it”
> ---------------------------------------------
> this,is,another,line of,text
> ---------------------------------------------
>
>
>
> Output using 2
> ---------------------------------------------
>
> this,is,a,"line with 1" problems",”in
> ---------------------------------------------
>
> it"
> this,is,another,line of,text
> ---------------------------------------------
>
>
>
>
>
> On 5/6/15, 9:12 AM, "Josh Wills" <jwills@cloudera.com> wrote:
>
> >On Wed, May 6, 2015 at 3:04 PM, Barry,Nathan <NBARRY@cerner.com> wrote:
> >
> >> Correct me if I am wrong, but with the current code when an unescaped
> >> quote is encountered the code doesn¹t always blow up, rather it become
> >>out
> >> of sequence with the open/close quotes which will often lead to
> >>scenarios
> >> where both the field delimiters and eol markers are now treated as if
> >>they
> >> are inside a quoted attribute (i.e. ignored) resulting in a CSV record
> >> that is potentially huge, with the size being determined when the code
> >> finds the next unescaped quote, at which point the code will honor the
> >> next field delimited/eol marker.
> >>
> >
> >That seems like the sort of thing worth fixing in the core, IMHO.
> >
> >
> >>
> >> So in our bad file examples CSV records were created with individual
> >> record sizes in the 10s of MB rather than the expected ~3-4KB.
> >>
> >> What we were looking to do was to:
> >> - when currently in a quoted field
> >> - if we find another quote
> >> - look to the next character and see:
> >> - if it¹s a delimiter, eol or eof marker - then close the quote and keep
> >> processing normally
> >> - if not a delimited, eol or eof marker - then we have a bad record, so
> >> ignore all quotes & delimiters and simply look for the next eol/eof and
> >> break the record there
> >>
> >> The thought being that the 1 bad record won¹t corrupt the entire file or
> >> corrupt the record splits; though the consumers of each record would
> >>then
> >> encounter errors when trying to parse the record, but they can then
> >> determine the best course of action:  ignore it, reject it, reject the
> >> whole file, etc.
> >>
> >> Nathan
> >>
> >> On 5/5/15, 9:30 AM, "Champion,Mac" <Mac.Champion@Cerner.com> wrote:
> >>
> >> >Some users of the CSV Input Format at Cerner had some issues with CSV
> >> >files from clients where there were stray, unescaped double-quotes
> >>inside
> >> >of fields (ostensibly representing inches). Some bureaucratic stuff
> >> >prevented us from getting those files reliably cleaned up, so we
> >> >brainstormed and figured out a way to make the CSV Input Format able to
> >> >ignore the stray quotes and pass them forward to be handled by whatever
> >> >parsing solution comes later. We are working on implementing this into
> >> >our copy of the input format and it seems to be working so far.
> >> >
> >> >My question is, is this something that we should log a JIRA for and
> >> >submit our work to Crunch as well? It¹s handy in our case, but the
> >>files
> >> >are truly malformed and not following the CSV standards. Should the
> >> >CSVInputFormat have configurable options to be able to handle malformed
> >> >files and pass bad records forward, or is the current behavior (blow up
> >> >and give some info about where the bad records start) the way it truly
> >> >should behave?
> >> >
> >> >Thanks for your input,
> >> >Mac
> >> >
> >> >CONFIDENTIALITY NOTICE This message and any included attachments are
> >>from
> >> >Cerner Corporation and are intended only for the addressee. The
> >> >information contained in this message is confidential and may
> >>constitute
> >> >inside or non-public information under international, federal, or state
> >> >securities laws. Unauthorized forwarding, printing, copying,
> >> >distribution, or use of such information is strictly prohibited and may
> >> >be unlawful. If you are not the addressee, please promptly delete this
> >> >message and notify the sender of the delivery error by e-mail or you
> >>may
> >> >call Cerner's corporate offices in Kansas City, Missouri, U.S.A at (+1)
> >> >(816)221-1024.
> >>
> >>
> >
> >
> >--
> >Director of Data Science
> >Cloudera <http://www.cloudera.com>
> >Twitter: @josh_wills <http://twitter.com/josh_wills>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message