hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Aaron Kimball (JIRA)" <j...@apache.org>
Subject [jira] Updated: (MAPREDUCE-1017) Compression and output splitting for Sqoop
Date Tue, 22 Sep 2009 00:35:15 GMT

     [ https://issues.apache.org/jira/browse/MAPREDUCE-1017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Aaron Kimball updated MAPREDUCE-1017:
-------------------------------------

    Attachment: MAPREDUCE-1017.patch

This patch introduces two new features/arguments to Sqoop:

* Data can be compressed via {{\-\-compress}} / {{\-z}}. This will enable gzipping of text
inputs
* Users can specify the approximate maximum file size used in direct mode with {{\-\-direct-split-size}},
which takes an argument in bytes, of the approximate file size to generate. After writing
a record which surpasses this boundary, a new file is opened. Because Sqoop uses buffered
writers, this file size is approximate, though Sqoop guarantees that new files will only be
opened on record boundaries.

The compression argument applies to non-direct-mode imports as well. Sqoop will now use a
compression codec for writing text files when using a MapReduce-based import. Sqoop used to
call {{SequenceFileOutputFormat.setCompressionEnabled(true)}}by default; this will now only
be the case if the user explicitly requests compression.

> Compression and output splitting for Sqoop
> ------------------------------------------
>
>                 Key: MAPREDUCE-1017
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1017
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>          Components: contrib/sqoop
>            Reporter: Aaron Kimball
>            Assignee: Aaron Kimball
>         Attachments: MAPREDUCE-1017.patch
>
>
> Sqoop "direct mode" writing will generate a single large text file in HDFS. It is important
to be able to compress this data before it reaches HDFS. Due to the difficulty in splitting
compressed files in HDFS for use by MapReduce jobs, data should also be split at compression
time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message