hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Runping Qi (JIRA)" <j...@apache.org>
Subject [jira] Created: (HADOOP-1284) clean up the protocol between stream mapper/reducer and the framework
Date Sat, 21 Apr 2007 00:33:15 GMT
clean up the protocol between stream mapper/reducer and the framework
---------------------------------------------------------------------

                 Key: HADOOP-1284
                 URL: https://issues.apache.org/jira/browse/HADOOP-1284
             Project: Hadoop
          Issue Type: Improvement
            Reporter: Runping Qi



Right now, the protocol between stream mapper/reducer and the framework is very inflexible.
The mapper/reducer generates line oriented output. The framework picks up line by line, and
split 
each line into a key/value pair. By default, the substring up to the first tab char is the
key, and the 
substring after the first tab char is the value.

However, in many cases, the application wants some control over how the pair is split. 
Here, I'd like to introduce the following configuration variables for that:

1. "streaming.output.field.separator": the value will be the tab key, by default. But the
user can specify a different one (e.g. '|', or ' ', etc.)
A map output line can be considered as a list of fields separated by the separator.

2. "streaming.num.fields.for.mapout.key":  the number of the first fields will be used the
map output key  (and for sorting in the reduce side). 
The default value is 1.
The rest of the fields will be used as the value.  For example, I can specify the first 5
fields as my mapout key.

3. "streaming.num.fields.for.partitioning": Sometimes, I want to use fewer fields for partitioning
to achieve "primary/secondary" composite 
key effect as proposed in HADOOP485. The default value is 1. For example, I can set "streaming.num.fields.for.partitioning"
to 3 
and "streaming.num.fields.for.mapout.key" to 5. This effectively amounts to saying that fields
4 and 5 are my secondary key.

With the above default values, it is compatible with the current behavior while introducing
a new desirable feature in a clean way.

Thoughts?




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message