hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tom White (JIRA)" <j...@apache.org>
Subject [jira] Commented: (MAPREDUCE-1017) Compression and output splitting for Sqoop
Date Thu, 15 Oct 2009 11:06:31 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12766013#action_12766013
] 

Tom White commented on MAPREDUCE-1017:
--------------------------------------

Overall this looks like a good addition. A few comments:
* HdfsSplitOutputStream doesn't seem to be HDFS-specific, so should be renamed (to BlockSplitOutputStream?).
In principle it could be used with another block-based filesystem, like S3FileSystem.
* HdfsSplitOutputStream uses CountingOutputStream to keep track of how many bytes have been
written. Could you use FSDataOutputStream#getPos() for this? (Also, there's a CountingOutputStream
in Apache Commons IO, which we already depend on.)
* It would be good to support more than just gzip compression in HdfsSplitOutputStream. The
machinery in org.apache.hadoop.io.compress should make this relatively straightforward.
* It would be good to have a unit test for HdfsSplitOutputStream.
* Formatting nits: there are a few redundant imports, and some lines are greater than 80 characters.

> Compression and output splitting for Sqoop
> ------------------------------------------
>
>                 Key: MAPREDUCE-1017
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1017
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>          Components: contrib/sqoop
>            Reporter: Aaron Kimball
>            Assignee: Aaron Kimball
>         Attachments: MAPREDUCE-1017.2.patch, MAPREDUCE-1017.patch
>
>
> Sqoop "direct mode" writing will generate a single large text file in HDFS. It is important
to be able to compress this data before it reaches HDFS. Due to the difficulty in splitting
compressed files in HDFS for use by MapReduce jobs, data should also be split at compression
time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message