crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Micah Whitacre (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CRUNCH-362) Add a CSV File Source
Date Fri, 04 Apr 2014 02:29:14 GMT

    [ https://issues.apache.org/jira/browse/CRUNCH-362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13959566#comment-13959566
] 

Micah Whitacre commented on CRUNCH-362:
---------------------------------------

Thanks for the patch Mac.  

I'm still working on reviewing it but here are a few things to fix up:

* For consistency with other sources CSVFileSource should support List<Path> as well.
* In CSVLineReader you do checking on if the escape character matches the quote.  We should
do that checking when we construct the source vs in the reader to give the feedback to the
consumer before the job is submitted to the cluster.

> Add a CSV File Source
> ---------------------
>
>                 Key: CRUNCH-362
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-362
>             Project: Crunch
>          Issue Type: Improvement
>          Components: Core
>    Affects Versions: 0.9.0
>            Reporter: mac champion
>            Assignee: mac champion
>            Priority: Trivial
>              Labels: csv, csvparser, inputformat
>             Fix For: 0.10.0
>
>         Attachments: 0001-CRUNCH-362-Add-CSVFileSource.patch
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> CSV files can be unpredictable. Among other quirks, it is possible for a single CSV record
to span multiple lines in a file. In cases like these, TextFileSource is not effective and
NLineFileSource is not flexible enough. 
> The result of this JIRA should be a CSVFileSource which, at minimum, should be able to
deal with multiple-line CSV records. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message