crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Brandon Inman (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CRUNCH-414) The CSV file source needs to be a little more robust when handling multi-line CSV files
Date Wed, 25 Jun 2014 18:34:25 GMT

    [ https://issues.apache.org/jira/browse/CRUNCH-414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14043885#comment-14043885
] 

Brandon Inman commented on CRUNCH-414:
--------------------------------------

I see what you are getting at.  

Regardless of this particular change, you might mitigate some memory concerns by moving the
StringBuilder and call to inputText.set() outside the loop and then look at moving the threshold
check inside the loop like you suggested.  You need the endOfFile check even with a threshold
check in case there isn't enough file left to trigger the threshold.

I agree that max int is pretty big, possibly too big to be useful because by that point we've
probably exhausted resources anyway.  Making it configurable would be a good way to go, tuning
the default with the assumption of a few gigabytes of memory available to it. Optimize around
stability and not necessarily performance, since this is an error case.  

For what it's worth, the files that I've seen are malformed with a relatively random distribution,
and so the largest unintentionally escaped chunks are generally 10k or so. Without evidence
either way, I'm suspecting that this would generalize to a lot of sources of data.  And since
they won't process correctly anyway, even if it doesn't trigger an exception, any occurrence
of these files will have to be corrected.

> The CSV file source needs to be a little more robust when handling multi-line CSV files
> ---------------------------------------------------------------------------------------
>
>                 Key: CRUNCH-414
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-414
>             Project: Crunch
>          Issue Type: Improvement
>          Components: Core
>    Affects Versions: 0.8.3
>            Reporter: mac champion
>            Assignee: mac champion
>            Priority: Minor
>              Labels: csv, csvparser
>             Fix For: 0.8.4
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Brandon Inman recently reported an undesriable behavior in the CSV file source group
of files. Currently, the CSVLineReader, if reading a malformed CSV file, can enter a state
where it is perpetually waiting for an end-quote character. As he put it, "Malformed files
are malformed files and should probably fail in some regard, but a hang is obviously undesirable."

> Essentially, the CSVLineReader needs to be tweaked in such a way that an informative
exception is thrown after some threshold is reached, instead of basically just hanging. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message