hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Arkady Borkovsky <ark...@yahoo-inc.com>
Subject Re: [jira] Updated: (HADOOP-1284) clean up the protocol between stream mapper/reducer and the framework
Date Wed, 25 Apr 2007 20:15:38 GMT
Wonderful!

On Apr 25, 2007, at 12:30 PM, Runping Qi wrote:

> Arkady,
>
> The FieldSelectionMapReduce class and KeyFieldBasedPartitioner class  
> allows
> to do exactly what you want (namely, you select fields 6,3,8 and 5 as  
> your
> sorting keys).
>
> Runping
>
>
>> -----Original Message-----
>> From: Arkady Borkovsky [mailto:arkady@yahoo-inc.com]
>> Sent: Wednesday, April 25, 2007 12:17 PM
>> To: hadoop-dev@lucene.apache.org
>> Subject: Re: [jira] Updated: (HADOOP-1284) clean up the protocol  
>> between
>> stream mapper/reducer and the framework
>>
>> Runping,
>>
>> as we discussed yesterday, it may be better to implement more complete
>> functionality that would allow to specify any combination of fields to
>> be used as for partitioning and for sorting.
>> This can be easily implemented top of the functionality this specific
>> patch provides.  (By prepending the actual keys by the "streaming
>> mapper" class, and stripping them in "streaming reducer" class before
>> feeding to the streaming reducer command provided by the user.
>>
>> However, at the user level, I'd suggest you export the "complete"
>> functionality, rather than limiting it by requiring the keys to be in
>> the beginning of the record.
>>
>> On Apr 25, 2007, at 11:13 AM, Runping Qi (JIRA) wrote:
>>
>>>
>>>      [
>>> https://issues.apache.org/jira/browse/HADOOP-1284?
>>> page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
>>>
>>> Runping Qi updated HADOOP-1284:
>>> -------------------------------
>>>
>>>     Description:
>>> Right now, the protocol between stream mapper/reducer and the
>>> framework is very inflexible.
>>> The mapper/reducer generates line oriented output. The framework  
>>> picks
>>> up line by line, and split
>>> each line into a key/value pair. By default, the substring up to the
>>> first tab char is the key, and the
>>> substring after the first tab char is the value.
>>>
>>> However, in many cases, the application wants some control over how
>>> the pair is split.
>>> Here, I'd like to introduce the following configuration variables for
>>> that:
>>>
>>> 1. "streaming.output.field.separator": the value will be the tab key,
>>> by default.
>>> But the user can specify a different one (e.g. ':', or ', ', etc.)
>>> A map output line can be considered as a list of fields separated by
>>> the separator.
>>>
>>> 2. "streaming.num.fields.for.mapout.key":  the number of the first
>>> fields will be used the map output key
>>> (and for sorting in the reduce side).
>>> The default value is 1.
>>> The rest of the fields will be used as the value.  For example, I can
>>> specify the first 5 fields as my mapout key.
>>>
>>> 3. "streaming.num.fields.for.partitioning": Sometimes, I want to use
>>> fewer fields for partitioning to
>>> achieve "primary/secondary" composite
>>> key effect as proposed in HADOOP485. The default value is 1.
>>> For example, I can set "streaming.num.fields.for.partitioning" to 3
>>> and "streaming.num.fields.for.mapout.key" to 5.
>>> This effectively amounts to saying that fields 4 and 5 are my
>>> secondary key.
>>>
>>> With the above default values, it is compatible with the current
>>> behavior
>>> while introducing a new desirable feature in a clean way.
>>>
>>> Thoughts?
>>>
>>>
>>>
>>>
>>>   was:
>>>
>>> Right now, the protocol between stream mapper/reducer and the
>>> framework is very inflexible.
>>> The mapper/reducer generates line oriented output. The framework  
>>> picks
>>> up line by line, and split
>>> each line into a key/value pair. By default, the substring up to the
>>> first tab char is the key, and the
>>> substring after the first tab char is the value.
>>>
>>> However, in many cases, the application wants some control over how
>>> the pair is split.
>>> Here, I'd like to introduce the following configuration variables for
>>> that:
>>>
>>> 1. "streaming.output.field.separator": the value will be the tab key,
>>> by default. But the user can specify a different one (e.g. '|', or '
>>> ', etc.)
>>> A map output line can be considered as a list of fields separated by
>>> the separator.
>>>
>>> 2. "streaming.num.fields.for.mapout.key":  the number of the first
>>> fields will be used the map output key  (and for sorting in the  
>>> reduce
>>> side).
>>> The default value is 1.
>>> The rest of the fields will be used as the value.  For example, I can
>>> specify the first 5 fields as my mapout key.
>>>
>>> 3. "streaming.num.fields.for.partitioning": Sometimes, I want to use
>>> fewer fields for partitioning to achieve "primary/secondary"  
>>> composite
>>> key effect as proposed in HADOOP485. The default value is 1. For
>>> example, I can set "streaming.num.fields.for.partitioning" to 3
>>> and "streaming.num.fields.for.mapout.key" to 5. This effectively
>>> amounts to saying that fields 4 and 5 are my secondary key.
>>>
>>> With the above default values, it is compatible with the current
>>> behavior while introducing a new desirable feature in a clean way.
>>>
>>> Thoughts?
>>>
>>>
>>>
>>>
>>>
>>> This patch implemented the proposed protocol.
>>>
>>> With this patch, the streaming user can specify a field separatot for
>>> the mapper's output and/or a field separator
>>> for the reducer's output. The default will be the tab char.
>>>
>>> The user can also specify how many fields in the output consitute the
>>> keys. The default is 1.
>>> The rest part of a line will be the value.
>>>
>>> A partitioner class, KeyFieldBasedPartitioner in mapred.lib, is also
>>> implemented.
>>> The user can specify the number of the fields in the map output keys
>>> will be used for partitioning.
>>>
>>> Also a urility class, FieldSelectionMapReduce in mapred.lib, is  
>>> added.
>>> This class allows the
>>> user to create map/reduce jobs that manapulate text data like the  
>>> Unix
>>> cut utility.
>>> The user can specify field separator (delimiter for cut) and specify
>>> which fields to select, and
>>> by which fields to partition/sort.
>>>
>>> Two unit tests are introduced.
>>> All the unit tests passed.
>>>
>>> [ Show > ] Runping Qi [25/Apr/07 11:07 AM] This patch implemented the
>>> proposed protocol. With this patch, the streaming user can specify a
>>> field separatot for the mapper's output and/or a field separator for
>>> the reducer's output. The default will be the tab char. The user can
>>> also specify how many fields in the output consitute the keys. The
>>> default is 1. The rest part of a line will be the value. A  
>>> partitioner
>>> class, KeyFieldBasedPartitioner in mapred.lib, is also implemented.
>>> The user can specify the number of the fields in the map output keys
>>> will be used for partitioning. Also a urility class,
>>> FieldSelectionMapReduce in mapred.lib, is added. This class allows  
>>> the
>>> user to create map/reduce jobs that manapulate text data like the  
>>> Unix
>>> cut utility. The user can specify field separator (delimiter for cut)
>>> and specify which fields to select, and by which fields to
>>> partition/sort. Two unit tests are introduced. All the unit tests
>>> passed.
>>>
>>>
>>>> clean up the protocol between stream mapper/reducer and the  
>>>> framework
>>>> -------------------------------------------------------------------- 
>>>> -
>>>>
>>>>                 Key: HADOOP-1284
>>>>                 URL:  
>>>> https://issues.apache.org/jira/browse/HADOOP-1284
>>>>             Project: Hadoop
>>>>          Issue Type: Improvement
>>>>            Reporter: Runping Qi
>>>>         Assigned To: Runping Qi
>>>>         Attachments: patch-1284.txt
>>>>
>>>>
>>>> Right now, the protocol between stream mapper/reducer and the
>>>> framework is very inflexible.
>>>> The mapper/reducer generates line oriented output. The framework
>>>> picks up line by line, and split
>>>> each line into a key/value pair. By default, the substring up to the
>>>> first tab char is the key, and the
>>>> substring after the first tab char is the value.
>>>> However, in many cases, the application wants some control over how
>>>> the pair is split.
>>>> Here, I'd like to introduce the following configuration variables  
>>>> for
>>>> that:
>>>> 1. "streaming.output.field.separator": the value will be the tab  
>>>> key,
>>>> by default.
>>>> But the user can specify a different one (e.g. ':', or ', ', etc.)
>>>> A map output line can be considered as a list of fields separated by
>>>> the separator.
>>>> 2. "streaming.num.fields.for.mapout.key":  the number of the first
>>>> fields will be used the map output key
>>>> (and for sorting in the reduce side).
>>>> The default value is 1.
>>>> The rest of the fields will be used as the value.  For example, I  
>>>> can
>>>> specify the first 5 fields as my mapout key.
>>>> 3. "streaming.num.fields.for.partitioning": Sometimes, I want to use
>>>> fewer fields for partitioning to
>>>> achieve "primary/secondary" composite
>>>> key effect as proposed in HADOOP485. The default value is 1.
>>>> For example, I can set "streaming.num.fields.for.partitioning" to 3
>>>> and "streaming.num.fields.for.mapout.key" to 5.
>>>> This effectively amounts to saying that fields 4 and 5 are my
>>>> secondary key.
>>>> With the above default values, it is compatible with the current
>>>> behavior
>>>> while introducing a new desirable feature in a clean way.
>>>> Thoughts?
>>>
>>> --
>>> This message is automatically generated by JIRA.
>>> -
>>> You can reply to this email to add a comment to the issue online.
>>>
>


Mime
View raw message