hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "zhuweimin (JIRA)" <j...@apache.org>
Subject [jira] Issue Comment Edited: (HADOOP-3227) Implement a binary input/output format for Streaming
Date Wed, 04 Mar 2009 05:05:58 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-3227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12678522#action_12678522
] 

chinashuimin edited comment on HADOOP-3227 at 3/3/09 9:05 PM:
-----------------------------------------------------------

we created two classes that process the standard binary file for hadoop0.19.1.
it's BinaryInputFormat and BinaryOutputFormat
It is necessary to modify the PipeMapper,PipeMapRed,PipeReducer class for that.

The attached file(hadoop-3227.patch) is patch.
The attached file(hadoop-0.19.1-streaming.jar) is jar file.

Usage is:
$HADOOP_HOME/bin/hadoop jar contrib/streaming/hadoop-0.19.1-streaming.jar
-input myInputDirs
-output myOutputDir
-mapper "command1"
-reducer "command2"
-inputformat org.apache.hadoop.streaming.BinaryInputFormat
-outputformat org.apache.hadoop.streaming.BinaryOutputFormat

for example:
1.the input is binary of map task,the output is text,and no reducer
$bin/hadoop jar contrib/streaming/hadoop-0.19.1-streaming.jar
-input myInputDirs
-output myOutputDir
-mapper "wc -c"
-numReduceTasks 0
-inputformat org.apache.hadoop.streaming.BinaryInputFormat

2.the map's input is binary file,the output is binary file too,and no reducer
$bin/hadoop jar contrib/streaming/hadoop-0.19.1-streaming.jar
-input myInputDirs
-output myOutputDir
-mapper "convert -resize 200% - -"
-numReduceTasks 0
-inputformat org.apache.hadoop.streaming.BinaryInputFormat
-outputformat org.apache.hadoop.streaming.BinaryOutputFormat

notes:the convert is from ImageMagick

3.the map's input is binary file,the output is binary file too,and the reducer's input is
binary file,but the output is text
$bin/hadoop jar contrib/streaming/hadoop-0.19.1-streaming.jar
-input myInputDirs
-output myOutputDir
-mapper "convert -resize 200% - -"
-reducer "wc -c"
  -numReduceTasks 1
-inputformat org.apache.hadoop.streaming.BinaryInputFormat
-outputformat org.apache.hadoop.streaming.BinaryOutputFormat

4.the map's input is binary file,the output is binary file too,and the reducer's input is
binary file,but the output is 

binary file too

It doesn't support it.

      was (Author: chinashuimin):
    we created two classes that process the standard binary file for hadoop0.19.1.
it's BinaryInputFormat and BinaryOutputFormat
It is necessary to modify the PipeMapper,PipeMapRed,PipeReducer class for that.

The attached file(hadoop-0.19.1-streaming-20090303.diff) is patch.
The attached file(hadoop-0.19.1-streaming.jar) is jar file.

Usage is:
$HADOOP_HOME/bin/hadoop jar contrib/streaming/hadoop-0.19.1-streaming.jar
-input myInputDirs
-output myOutputDir
-mapper "command1"
-reducer "command2"
-inputformat org.apache.hadoop.streaming.BinaryInputFormat
-outputformat org.apache.hadoop.streaming.BinaryOutputFormat

for example:
1.the input is binary of map task,the output is text,and no reducer
$bin/hadoop jar contrib/streaming/hadoop-0.19.1-streaming.jar
-input myInputDirs
-output myOutputDir
-mapper "wc -c"
-numReduceTasks 0
-inputformat org.apache.hadoop.streaming.BinaryInputFormat

2.the map's input is binary file,the output is binary file too,and no reducer
$bin/hadoop jar contrib/streaming/hadoop-0.19.1-streaming.jar
-input myInputDirs
-output myOutputDir
-mapper "convert -resize 200% - -"
-numReduceTasks 0
-inputformat org.apache.hadoop.streaming.BinaryInputFormat
-outputformat org.apache.hadoop.streaming.BinaryOutputFormat

notes:the convert is from ImageMagick

3.the map's input is binary file,the output is binary file too,and the reducer's input is
binary file,but the output is text
$bin/hadoop jar contrib/streaming/hadoop-0.19.1-streaming.jar
-input myInputDirs
-output myOutputDir
-mapper "convert -resize 200% - -"
-reducer "wc -c"
  -numReduceTasks 1
-inputformat org.apache.hadoop.streaming.BinaryInputFormat
-outputformat org.apache.hadoop.streaming.BinaryOutputFormat

4.the map's input is binary file,the output is binary file too,and the reducer's input is
binary file,but the output is 

binary file too

It doesn't support it.
  
> Implement a binary input/output format for Streaming
> ----------------------------------------------------
>
>                 Key: HADOOP-3227
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3227
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: contrib/streaming
>    Affects Versions: 0.19.1
>            Reporter: Arun C Murthy
>            Assignee: Arun C Murthy
>         Attachments: hadoop-0.19.1-streaming.jar, hadoop-3227.patch
>
>
> Lots of streaming applications process textual data with 1 record per line and fields
separated by a delimiter. It turns out that there is no point in using any of Hadoop's input/output
formats since the streaming script/binary itself will parse the input and break into records
and fields. In such cases we should provide users with a binary input/output format which
just sends 64k (or so) blocks of data directly from HDFS to the streaming application.
> I did something very similar for Pig-Streaming (PIG-94 - BinaryStorage) which resulted
in 300%+ speedup for scanning (identity mapper & map-only jobs) data... the parsing done
by input/output formats in these cases were pure-overhead.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message