hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Harsh J (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-2208) Flexible CSV text parser InputFormat
Date Thu, 27 Oct 2011 04:23:33 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13136765#comment-13136765
] 

Harsh J commented on MAPREDUCE-2208:
------------------------------------

I'd suggest reusing OpenCSV instead, if it is possible to. I do think the
license is compatible, and it is well maintained.

On Thursday, October 27, 2011, Maksym Kovalenko (Commented) (JIRA) <
https://issues.apache.org/jira/browse/MAPREDUCE-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13136680#comment-13136680]
uses comma as a delimiter and happen to have comma in one of the values, for
example:
7 columns for the above case instead of 3.
In this case according to CSV escaping rules it has to be escaped by another
double quote, for example:
instead of patterns, one had to provide delimiter character (comma by
default) and quote character (double quote by default). Then I and other
users won't have to struggle with possible regex patterns (see my questions
above, I'm still curious if you can come up with one).
any regexes that you need if necessary (if you want to stick to current
implementation). By the way, right now you have some fragility in the
implementation when you prepend user provided regex with a "\\". This will
break in case when user supplied pattern itself starts with "\\".
csv-style datasets I've found. The Hadoop samples I've seen all
FileInputFormat and Mapper<LongWritable,Text>. They drop the Longwritable
key and parse the Text value as a CSV line. But, they are all custom-coded
for the format.
into the format required by a Mapper. You can drop fields & rearrange them.
There is also a random sampling option to make training/test runs easier.
org.apache.hadoop.mapreduce.lib.input under src/java and test/mapred/src.
administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa

-- 
Harsh J

                
> Flexible CSV text parser InputFormat
> ------------------------------------
>
>                 Key: MAPREDUCE-2208
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2208
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>            Reporter: Lance Norskog
>            Priority: Trivial
>         Attachments: CSVTextInputFormat.java, TestCSVTextFormat.java
>
>
> CSVTextInputFormat is a configurable CSV parser tuned to most of the csv-style datasets
I've found. The Hadoop samples I've seen all FileInputFormat and Mapper<LongWritable,Text>.
They drop the Longwritable key and parse the Text value as a CSV line. But, they are all custom-coded
for the format.
> CSVTextInputFormat takes any csv-encoded file and rearrange the fields into the format
required by a Mapper. You can drop fields & rearrange them. There is also a random sampling
option to make training/test runs easier.
> Attached are CSVTextInputFormat.java and a unit test for it. Both go into org.apache.hadoop.mapreduce.lib.input
under src/java and test/mapred/src.
> This is compiled against hadoop-0.0.20.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message