crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Brandon Inman (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CRUNCH-414) The CSV file source needs to be a little more robust when handling multi-line CSV files
Date Thu, 26 Jun 2014 15:11:25 GMT

    [ https://issues.apache.org/jira/browse/CRUNCH-414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14044739#comment-14044739
] 

Brandon Inman commented on CRUNCH-414:
--------------------------------------

I feel like having two outputs from the do/while loop (ie the string buffer and the inputText,
where only one is a valid output at a time) adds extra complexity to the code, and the CSV
code is already necessarily a little complicated to begin with.  Unless there is a demonstrable
performance concern with using the string buffer for the cases that don't need it (understanding
that records without embedded newlines may not be the majority of cases), I would make the
string buffer the source of truth. 

Also, is the last part of the record accounted for in this algorithm?  That is, after we are
out of the quoted section, currentlyInQuotes should be false and therefore the final value
of inputText gets lost, rather than appended to the string buffer.

> The CSV file source needs to be a little more robust when handling multi-line CSV files
> ---------------------------------------------------------------------------------------
>
>                 Key: CRUNCH-414
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-414
>             Project: Crunch
>          Issue Type: Improvement
>          Components: Core
>    Affects Versions: 0.8.3
>            Reporter: mac champion
>            Assignee: mac champion
>            Priority: Minor
>              Labels: csv, csvparser
>             Fix For: 0.8.4
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Brandon Inman recently reported an undesriable behavior in the CSV file source group
of files. Currently, the CSVLineReader, if reading a malformed CSV file, can enter a state
where it is perpetually waiting for an end-quote character. As he put it, "Malformed files
are malformed files and should probably fail in some regard, but a hang is obviously undesirable."

> Essentially, the CSVLineReader needs to be tweaked in such a way that an informative
exception is thrown after some threshold is reached, instead of basically just hanging. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message