crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Brandon Inman (JIRA)" <>
Subject [jira] [Commented] (CRUNCH-414) The CSV file source needs to be a little more robust when handling multi-line CSV files
Date Wed, 25 Jun 2014 22:08:26 GMT


Brandon Inman commented on CRUNCH-414:

{quote} Not sure I understand about moving the .set/stringBuilder stuff. {quote}
Maybe I can communicate it better in code (again, treat as psuedocode, this isn't tested for
compiling or completeness)...

    StringBuilder builder = new StringBuilder();
    do {
      totalBytesConsumed += readFileLine(inputText);
      // a line has been read. We need to see if we're still in quotes and tack
      // on a newline if so
      // TODO: endOfFile check may not be necessary here
      if (currentlyInQuotes && !endOfFile) {
        // Add one LF to mark the line return, otherwise any multi-line CSV
        // record will all be on one line.
      if (totalBytesConsumed > QUOTED_SECTION_THRESHOLD_VALUE ) {
        throw new IOException("Too many bytes consumed before newline: " + totalBytesConsumed);
    } while (currentlyInQuotes && !endOfFile);

    return (int) totalBytesConsumed;

This may have the potential to lose an ending LF, but I think that was already the case.

{quote} Can you think of a situation where one CSV record would be larger than the size of
the pieces the CSV file should be split into? {quote}

I'm actually curious how Hadoop/Crunch generally deals with records larger than a split size.
While it's not going to be a common use case, I can see the possibility for a CSV to have
extremely large escaped sections that could exceed 64mb (Base64 image data? raw sensor data?
XML documents in a database?)  Ultimately, if it's configurable, the default is less important
and keeping it around split size should be pretty sensible, since someone would likely be
adjusting that if they know that they are processing huge records.

{quote}As for evenly-malformed files, you're right, they won't trigger an exception here,
but will have be dealt with either manually or by more detailed parsing after these lines
are read.{quote}

My thought is that will probably require the same kind of anomaly detection that would normally
be applied to detect other types bad data.

> The CSV file source needs to be a little more robust when handling multi-line CSV files
> ---------------------------------------------------------------------------------------
>                 Key: CRUNCH-414
>                 URL:
>             Project: Crunch
>          Issue Type: Improvement
>          Components: Core
>    Affects Versions: 0.8.3
>            Reporter: mac champion
>            Assignee: mac champion
>            Priority: Minor
>              Labels: csv, csvparser
>             Fix For: 0.8.4
>   Original Estimate: 336h
>  Remaining Estimate: 336h
> Brandon Inman recently reported an undesriable behavior in the CSV file source group
of files. Currently, the CSVLineReader, if reading a malformed CSV file, can enter a state
where it is perpetually waiting for an end-quote character. As he put it, "Malformed files
are malformed files and should probably fail in some regard, but a hang is obviously undesirable."

> Essentially, the CSVLineReader needs to be tweaked in such a way that an informative
exception is thrown after some threshold is reached, instead of basically just hanging. 

This message was sent by Atlassian JIRA

View raw message