hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Maksym Kovalenko (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-2208) Flexible CSV text parser InputFormat
Date Thu, 27 Oct 2011 01:39:33 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13136680#comment-13136680
] 

Maksym Kovalenko commented on MAPREDUCE-2208:
---------------------------------------------

So what regex one would need to specify to parse the "normal" CSV that uses comma as a delimiter
and happen to have comma in one of the values, for example:

value1,value2,"more,complex,with,commas,value3"

just providing "," as the pattern1 will no longer work as it will produce 7 columns for the
above case instead of 3.

Also consider the following use case when value contains a double quoute. In this case according
to CSV escaping rules it has to be escaped by another double quote, for example:

column1,"thank you, ""User"" for the report, again, thank you",column3

Considering above two cases what value for pattern1 should I provide?

I think configuration of CSVTextInputFormat would be more natural if instead of patterns,
one had to provide delimiter character (comma by default) and quote character (double quote
by default). Then I and other users won't have to struggle with possible regex patterns (see
my questions above, I'm still curious if you can come up with one).

Another benefit is that from delimiter and quote characters you can create any regexes that
you need if necessary (if you want to stick to current implementation). By the way, right
now you have some fragility in the implementation when you prepend user provided regex with
a "\\". This will break in case when user supplied pattern itself starts with "\\".
                
> Flexible CSV text parser InputFormat
> ------------------------------------
>
>                 Key: MAPREDUCE-2208
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2208
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>            Reporter: Lance Norskog
>            Priority: Trivial
>         Attachments: CSVTextInputFormat.java, TestCSVTextFormat.java
>
>
> CSVTextInputFormat is a configurable CSV parser tuned to most of the csv-style datasets
I've found. The Hadoop samples I've seen all FileInputFormat and Mapper<LongWritable,Text>.
They drop the Longwritable key and parse the Text value as a CSV line. But, they are all custom-coded
for the format.
> CSVTextInputFormat takes any csv-encoded file and rearrange the fields into the format
required by a Mapper. You can drop fields & rearrange them. There is also a random sampling
option to make training/test runs easier.
> Attached are CSVTextInputFormat.java and a unit test for it. Both go into org.apache.hadoop.mapreduce.lib.input
under src/java and test/mapred/src.
> This is compiled against hadoop-0.0.20.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message