crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gabriel Reid (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CRUNCH-632) Add compression support for CSVFileSource
Date Wed, 11 Jan 2017 07:31:58 GMT

    [ https://issues.apache.org/jira/browse/CRUNCH-632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15817513#comment-15817513
] 

Gabriel Reid commented on CRUNCH-632:
-------------------------------------

Just to clarify on the combination of compression and text files, you're right that they aren't
typically splittable (assuming gzip compression is used), but for example Snappy compression
does support input splits.

The [o.a.h.mapreduce.lib.input.TextInputFormat#isSplitable|http://grepcode.com/file/repo1.maven.org/maven2/com.ning/metrics.action/0.2.7/org/apache/hadoop/mapreduce/lib/input/TextInputFormat.java#45]
method can be implemented so that we can decide what to do about splitting.

> Add compression support for CSVFileSource
> -----------------------------------------
>
>                 Key: CRUNCH-632
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-632
>             Project: Crunch
>          Issue Type: Improvement
>            Reporter: Jim McStanton
>            Priority: Minor
>
> Currently CSVFileSource does not support decompressing files before reading them, and
simply opens the file and starts reading the contents: https://github.com/apache/crunch/blob/6280983179e9c690af69c2bf0e296b054122d724/crunch-core/src/main/java/org/apache/crunch/io/text/csv/CSVRecordReader.java#L127.

> This source would more closely match TextFileSource if this support was added. The {{LineRecordReader}}
supports this behavior [here|http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hadoop/hadoop-mapreduce-client-core/2.7.1/org/apache/hadoop/mapreduce/lib/input/LineRecordReader.java?av=f#87].




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message