hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Runping Qi (JIRA)" <j...@apache.org>
Subject [jira] Updated: (HADOOP-1284) clean up the protocol between stream mapper/reducer and the framework
Date Wed, 25 Apr 2007 18:13:15 GMT

     [ https://issues.apache.org/jira/browse/HADOOP-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Runping Qi updated HADOOP-1284:

    Status: Patch Available  (was: Open)

> clean up the protocol between stream mapper/reducer and the framework
> ---------------------------------------------------------------------
>                 Key: HADOOP-1284
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1284
>             Project: Hadoop
>          Issue Type: Improvement
>            Reporter: Runping Qi
>         Assigned To: Runping Qi
>         Attachments: patch-1284.txt
> Right now, the protocol between stream mapper/reducer and the framework is very inflexible.
> The mapper/reducer generates line oriented output. The framework picks up line by line,
and split 
> each line into a key/value pair. By default, the substring up to the first tab char is
the key, and the 
> substring after the first tab char is the value.
> However, in many cases, the application wants some control over how the pair is split.

> Here, I'd like to introduce the following configuration variables for that:
> 1. "streaming.output.field.separator": the value will be the tab key, by default. 
> But the user can specify a different one (e.g. ':', or ', ', etc.)
> A map output line can be considered as a list of fields separated by the separator.
> 2. "streaming.num.fields.for.mapout.key":  the number of the first fields will be used
the map output key  
> (and for sorting in the reduce side). 
> The default value is 1.
> The rest of the fields will be used as the value.  For example, I can specify the first
5 fields as my mapout key.
> 3. "streaming.num.fields.for.partitioning": Sometimes, I want to use fewer fields for
partitioning to 
> achieve "primary/secondary" composite 
> key effect as proposed in HADOOP485. The default value is 1. 
> For example, I can set "streaming.num.fields.for.partitioning" to 3 
> and "streaming.num.fields.for.mapout.key" to 5. 
> This effectively amounts to saying that fields 4 and 5 are my secondary key.
> With the above default values, it is compatible with the current behavior 
> while introducing a new desirable feature in a clean way.
> Thoughts?

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message