crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <jwi...@cloudera.com>
Subject Re: Enhancement to CSV input format?
Date Wed, 06 May 2015 14:12:52 GMT
On Wed, May 6, 2015 at 3:04 PM, Barry,Nathan <NBARRY@cerner.com> wrote:

> Correct me if I am wrong, but with the current code when an unescaped
> quote is encountered the code doesn¹t always blow up, rather it become out
> of sequence with the open/close quotes which will often lead to scenarios
> where both the field delimiters and eol markers are now treated as if they
> are inside a quoted attribute (i.e. ignored) resulting in a CSV record
> that is potentially huge, with the size being determined when the code
> finds the next unescaped quote, at which point the code will honor the
> next field delimited/eol marker.
>

That seems like the sort of thing worth fixing in the core, IMHO.


>
> So in our bad file examples CSV records were created with individual
> record sizes in the 10s of MB rather than the expected ~3-4KB.
>
> What we were looking to do was to:
> - when currently in a quoted field
> - if we find another quote
> - look to the next character and see:
> - if it¹s a delimiter, eol or eof marker - then close the quote and keep
> processing normally
> - if not a delimited, eol or eof marker - then we have a bad record, so
> ignore all quotes & delimiters and simply look for the next eol/eof and
> break the record there
>
> The thought being that the 1 bad record won¹t corrupt the entire file or
> corrupt the record splits; though the consumers of each record would then
> encounter errors when trying to parse the record, but they can then
> determine the best course of action:  ignore it, reject it, reject the
> whole file, etc.
>
> Nathan
>
> On 5/5/15, 9:30 AM, "Champion,Mac" <Mac.Champion@Cerner.com> wrote:
>
> >Some users of the CSV Input Format at Cerner had some issues with CSV
> >files from clients where there were stray, unescaped double-quotes inside
> >of fields (ostensibly representing inches). Some bureaucratic stuff
> >prevented us from getting those files reliably cleaned up, so we
> >brainstormed and figured out a way to make the CSV Input Format able to
> >ignore the stray quotes and pass them forward to be handled by whatever
> >parsing solution comes later. We are working on implementing this into
> >our copy of the input format and it seems to be working so far.
> >
> >My question is, is this something that we should log a JIRA for and
> >submit our work to Crunch as well? It¹s handy in our case, but the files
> >are truly malformed and not following the CSV standards. Should the
> >CSVInputFormat have configurable options to be able to handle malformed
> >files and pass bad records forward, or is the current behavior (blow up
> >and give some info about where the bad records start) the way it truly
> >should behave?
> >
> >Thanks for your input,
> >Mac
> >
> >CONFIDENTIALITY NOTICE This message and any included attachments are from
> >Cerner Corporation and are intended only for the addressee. The
> >information contained in this message is confidential and may constitute
> >inside or non-public information under international, federal, or state
> >securities laws. Unauthorized forwarding, printing, copying,
> >distribution, or use of such information is strictly prohibited and may
> >be unlawful. If you are not the addressee, please promptly delete this
> >message and notify the sender of the delivery error by e-mail or you may
> >call Cerner's corporate offices in Kansas City, Missouri, U.S.A at (+1)
> >(816)221-1024.
>
>


-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message