crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gabriel Reid (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CRUNCH-632) Add compression support for CSVFileSource
Date Thu, 12 Jan 2017 07:31:52 GMT

    [ https://issues.apache.org/jira/browse/CRUNCH-632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15820377#comment-15820377
] 

Gabriel Reid commented on CRUNCH-632:
-------------------------------------

Yep, you're right.

Looking into things a bit more in detail, it appears that there is only one compression codec
(BZip2Codec) which does allow splits and reading from an arbitrary point in a file, but looking
at the extra effort that is required to make this work (CompressedSplitLineReader), and particularly
considering that bzip2 doesn't seem to be used all that much, it doesn't seem worth the extra
work.

> Add compression support for CSVFileSource
> -----------------------------------------
>
>                 Key: CRUNCH-632
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-632
>             Project: Crunch
>          Issue Type: Improvement
>            Reporter: Jim McStanton
>            Assignee: Micah Whitacre
>            Priority: Minor
>         Attachments: CRUNCH-632.patch, CRUNCH-632b.patch
>
>
> Currently CSVFileSource does not support decompressing files before reading them, and
simply opens the file and starts reading the contents: https://github.com/apache/crunch/blob/6280983179e9c690af69c2bf0e296b054122d724/crunch-core/src/main/java/org/apache/crunch/io/text/csv/CSVRecordReader.java#L127.

> This source would more closely match TextFileSource if this support was added. The {{LineRecordReader}}
supports this behavior [here|http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hadoop/hadoop-mapreduce-client-core/2.7.1/org/apache/hadoop/mapreduce/lib/input/LineRecordReader.java?av=f#87].




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message