hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Aaron Kimball (JIRA)" <j...@apache.org>
Subject [jira] Commented: (MAPREDUCE-1017) Compression and output splitting for Sqoop
Date Thu, 15 Oct 2009 19:00:32 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12766174#action_12766174
] 

Aaron Kimball commented on MAPREDUCE-1017:
------------------------------------------

Thanks for the review. Some responses:

* You're correct about the name of HdfsSplitOutputStream. Will change.
* Thanks for the pointer about existing counting streams.
* I agree that this should eventually support multiple compression codecs, but the burden
is on the application to select the correct codec based on the intended file extension. That
would add even more code to this ticket; I'll move toward support for additional codecs (bz2,
etc.) in a subsequent JIRA.
* Do you think a separate test is necessary for HdfsSplitOutputStream? This is tested through
TestSplittableBufferedWriter. The SplittableBufferedWriter and SplitOutputStream classes are
pretty tightly coupled -- SplittableBufferedWriter does virtually nothing but wrap the OutputStream
in a BufferedWriter. {{testSplittingTextFile()}}, for example, only passes because of {{HdfsSplitOutputStream.openNextFile()}}.
* I'll clean up the formatting a bit in the next patch.

Will submit a new patch soon.


> Compression and output splitting for Sqoop
> ------------------------------------------
>
>                 Key: MAPREDUCE-1017
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1017
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>          Components: contrib/sqoop
>            Reporter: Aaron Kimball
>            Assignee: Aaron Kimball
>         Attachments: MAPREDUCE-1017.2.patch, MAPREDUCE-1017.patch
>
>
> Sqoop "direct mode" writing will generate a single large text file in HDFS. It is important
to be able to compress this data before it reaches HDFS. Due to the difficulty in splitting
compressed files in HDFS for use by MapReduce jobs, data should also be split at compression
time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message