crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "mac champion (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CRUNCH-414) The CSV file source needs to be a little more robust when handling multi-line CSV files
Date Wed, 25 Jun 2014 21:43:24 GMT

    [ https://issues.apache.org/jira/browse/CRUNCH-414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14044088#comment-14044088
] 

mac champion commented on CRUNCH-414:
-------------------------------------

Not sure I understand about moving the .set/stringBuilder stuff. Each execution of the loop
is another in-quotes line which needs to have a newline appended to it. I think it still needs
to be appended inside of the loop, right?

I added the EOF check to the do-while, you're definitely right about that. 

What do you think about the threshold being split size? 64mb is the default, but it can be
configured (check CSVInputFormat's getSplits method to see how it's used). That seems like
a decently logical place to abort if the end of the record hasn't been found. Can you think
of a situation where one CSV record would be larger than the size of the pieces the CSV file
should be split into?

As for evenly-malformed files, you're right, they won't trigger an exception here, but will
have be dealt with either manually or by more detailed parsing after these lines are read.


> The CSV file source needs to be a little more robust when handling multi-line CSV files
> ---------------------------------------------------------------------------------------
>
>                 Key: CRUNCH-414
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-414
>             Project: Crunch
>          Issue Type: Improvement
>          Components: Core
>    Affects Versions: 0.8.3
>            Reporter: mac champion
>            Assignee: mac champion
>            Priority: Minor
>              Labels: csv, csvparser
>             Fix For: 0.8.4
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Brandon Inman recently reported an undesriable behavior in the CSV file source group
of files. Currently, the CSVLineReader, if reading a malformed CSV file, can enter a state
where it is perpetually waiting for an end-quote character. As he put it, "Malformed files
are malformed files and should probably fail in some regard, but a hang is obviously undesirable."

> Essentially, the CSVLineReader needs to be tweaked in such a way that an informative
exception is thrown after some threshold is reached, instead of basically just hanging. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message