hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Owen O'Malley (JIRA)" <j...@apache.org>
Subject [jira] Updated: (HADOOP-3341) make key-value separators in hadoop streaming fully configurable
Date Wed, 07 May 2008 22:53:55 GMT

     [ https://issues.apache.org/jira/browse/HADOOP-3341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Owen O'Malley updated HADOOP-3341:
----------------------------------

    Status: Open  (was: Patch Available)

This looks good, except that the data fields should be down in PipeMapper and PipeReducer,
respectively. They should also be made private. You can configure them in the PipeMapper and
PipeReducer configure methods. Please also include a test for the change.

> make key-value separators in hadoop streaming fully configurable
> ----------------------------------------------------------------
>
>                 Key: HADOOP-3341
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3341
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: contrib/streaming
>            Reporter: Zheng Shao
>            Assignee: Zheng Shao
>         Attachments: 3341-1.patch
>
>
> By default, hadoop streaming uses TAB as the separator in all places.  However in some
environments, user may want to use customized separators (e.g, ^A = \u0001).
> The separator logic in hadoop streaming is very convoluted. Here is a brief summary:
> InputFormat {
>     KeyValueLineRecordReader.java:59:
> S1: String sepStr = job.get("key.value.separator.in.input.line", "\t");
> }
> Mapper {
>     PipeMapper.java:88: 
> S2: clientOut_.write('\t');
>     Call mapper process
>     PipeMapRed.java:124:
> S3: String mapOutputFieldSeparator = job_.get("stream.map.output.field.separator", "\t");
>     PipeMapRed.java:128:
>     this.numOfMapOutputKeyFields = job_.getInt("stream.num.map.output.key.fields", 1);
> }
> Reducer {
>     PipeReducer.java:78:
> S4: clientOut_.write('\t');
>     Call reducer process
>     PipeMapRed.java:125:
> S5: String reduceOutputFieldSeparator = job_.get("stream.reduce.output.field.separator",
"\t");
>     PipeMapRed.java:129:
>     this.numOfReduceOutputKeyFields = job_.getInt("stream.num.reduce.output.key.fields",
1);
> }
> OutputFormat {
>     TextOuputFormat.java:112:
> S6: String keyValueSeparator = job.get("mapred.textoutputformat.separator", "\t");
> }
> Short-cuts: 
> 1. In case we use "TextInputFormat", S1 and S2 are not used at all. Lines are directly
feed into the mapper (through the value part of the key-value pair - keys, which are offsets,
are directly ignored).
> 2. For jobs with no reducers, The "Reducer" step is skipped.
> We need to make S3 and S4 configurable, possibly under the following names for conformity:
> stream.map.input.field.separator
> stream.reduce.input.field.separator
> Then, by specifying: -jobconf key.value.separator.in.input.line=^A -jobconf stream.map.input.field.separator=^A
-jobconf stream.map.output.field.separator=^A -jobconf stream.reducer.input.field.separator=^A
-jobconf stream.reducer.output.field.separator=^A -jobconf mapred.textoutputformat.separator=^A,
we will be able to use ^A instead of TAB in every place!
> Maybe hadoop streaming can also provide a single option to override these 6 options.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message