hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Runping Qi" <runp...@yahoo-inc.com>
Subject RE: [jira] Updated: (HADOOP-1284) clean up the protocol between stream mapper/reducer and the framework
Date Wed, 25 Apr 2007 19:30:49 GMT
Arkady,

The FieldSelectionMapReduce class and KeyFieldBasedPartitioner class allows
to do exactly what you want (namely, you select fields 6,3,8 and 5 as your
sorting keys).

Runping


> -----Original Message-----
> From: Arkady Borkovsky [mailto:arkady@yahoo-inc.com]
> Sent: Wednesday, April 25, 2007 12:17 PM
> To: hadoop-dev@lucene.apache.org
> Subject: Re: [jira] Updated: (HADOOP-1284) clean up the protocol between
> stream mapper/reducer and the framework
> 
> Runping,
> 
> as we discussed yesterday, it may be better to implement more complete
> functionality that would allow to specify any combination of fields to
> be used as for partitioning and for sorting.
> This can be easily implemented top of the functionality this specific
> patch provides.  (By prepending the actual keys by the "streaming
> mapper" class, and stripping them in "streaming reducer" class before
> feeding to the streaming reducer command provided by the user.
> 
> However, at the user level, I'd suggest you export the "complete"
> functionality, rather than limiting it by requiring the keys to be in
> the beginning of the record.
> 
> On Apr 25, 2007, at 11:13 AM, Runping Qi (JIRA) wrote:
> 
> >
> >      [
> > https://issues.apache.org/jira/browse/HADOOP-1284?
> > page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
> >
> > Runping Qi updated HADOOP-1284:
> > -------------------------------
> >
> >     Description:
> > Right now, the protocol between stream mapper/reducer and the
> > framework is very inflexible.
> > The mapper/reducer generates line oriented output. The framework picks
> > up line by line, and split
> > each line into a key/value pair. By default, the substring up to the
> > first tab char is the key, and the
> > substring after the first tab char is the value.
> >
> > However, in many cases, the application wants some control over how
> > the pair is split.
> > Here, I'd like to introduce the following configuration variables for
> > that:
> >
> > 1. "streaming.output.field.separator": the value will be the tab key,
> > by default.
> > But the user can specify a different one (e.g. ':', or ', ', etc.)
> > A map output line can be considered as a list of fields separated by
> > the separator.
> >
> > 2. "streaming.num.fields.for.mapout.key":  the number of the first
> > fields will be used the map output key
> > (and for sorting in the reduce side).
> > The default value is 1.
> > The rest of the fields will be used as the value.  For example, I can
> > specify the first 5 fields as my mapout key.
> >
> > 3. "streaming.num.fields.for.partitioning": Sometimes, I want to use
> > fewer fields for partitioning to
> > achieve "primary/secondary" composite
> > key effect as proposed in HADOOP485. The default value is 1.
> > For example, I can set "streaming.num.fields.for.partitioning" to 3
> > and "streaming.num.fields.for.mapout.key" to 5.
> > This effectively amounts to saying that fields 4 and 5 are my
> > secondary key.
> >
> > With the above default values, it is compatible with the current
> > behavior
> > while introducing a new desirable feature in a clean way.
> >
> > Thoughts?
> >
> >
> >
> >
> >   was:
> >
> > Right now, the protocol between stream mapper/reducer and the
> > framework is very inflexible.
> > The mapper/reducer generates line oriented output. The framework picks
> > up line by line, and split
> > each line into a key/value pair. By default, the substring up to the
> > first tab char is the key, and the
> > substring after the first tab char is the value.
> >
> > However, in many cases, the application wants some control over how
> > the pair is split.
> > Here, I'd like to introduce the following configuration variables for
> > that:
> >
> > 1. "streaming.output.field.separator": the value will be the tab key,
> > by default. But the user can specify a different one (e.g. '|', or '
> > ', etc.)
> > A map output line can be considered as a list of fields separated by
> > the separator.
> >
> > 2. "streaming.num.fields.for.mapout.key":  the number of the first
> > fields will be used the map output key  (and for sorting in the reduce
> > side).
> > The default value is 1.
> > The rest of the fields will be used as the value.  For example, I can
> > specify the first 5 fields as my mapout key.
> >
> > 3. "streaming.num.fields.for.partitioning": Sometimes, I want to use
> > fewer fields for partitioning to achieve "primary/secondary" composite
> > key effect as proposed in HADOOP485. The default value is 1. For
> > example, I can set "streaming.num.fields.for.partitioning" to 3
> > and "streaming.num.fields.for.mapout.key" to 5. This effectively
> > amounts to saying that fields 4 and 5 are my secondary key.
> >
> > With the above default values, it is compatible with the current
> > behavior while introducing a new desirable feature in a clean way.
> >
> > Thoughts?
> >
> >
> >
> >
> >
> > This patch implemented the proposed protocol.
> >
> > With this patch, the streaming user can specify a field separatot for
> > the mapper's output and/or a field separator
> > for the reducer's output. The default will be the tab char.
> >
> > The user can also specify how many fields in the output consitute the
> > keys. The default is 1.
> > The rest part of a line will be the value.
> >
> > A partitioner class, KeyFieldBasedPartitioner in mapred.lib, is also
> > implemented.
> > The user can specify the number of the fields in the map output keys
> > will be used for partitioning.
> >
> > Also a urility class, FieldSelectionMapReduce in mapred.lib, is added.
> > This class allows the
> > user to create map/reduce jobs that manapulate text data like the Unix
> > cut utility.
> > The user can specify field separator (delimiter for cut) and specify
> > which fields to select, and
> > by which fields to partition/sort.
> >
> > Two unit tests are introduced.
> > All the unit tests passed.
> >
> > [ Show > ] Runping Qi [25/Apr/07 11:07 AM] This patch implemented the
> > proposed protocol. With this patch, the streaming user can specify a
> > field separatot for the mapper's output and/or a field separator for
> > the reducer's output. The default will be the tab char. The user can
> > also specify how many fields in the output consitute the keys. The
> > default is 1. The rest part of a line will be the value. A partitioner
> > class, KeyFieldBasedPartitioner in mapred.lib, is also implemented.
> > The user can specify the number of the fields in the map output keys
> > will be used for partitioning. Also a urility class,
> > FieldSelectionMapReduce in mapred.lib, is added. This class allows the
> > user to create map/reduce jobs that manapulate text data like the Unix
> > cut utility. The user can specify field separator (delimiter for cut)
> > and specify which fields to select, and by which fields to
> > partition/sort. Two unit tests are introduced. All the unit tests
> > passed.
> >
> >
> >> clean up the protocol between stream mapper/reducer and the framework
> >> ---------------------------------------------------------------------
> >>
> >>                 Key: HADOOP-1284
> >>                 URL: https://issues.apache.org/jira/browse/HADOOP-1284
> >>             Project: Hadoop
> >>          Issue Type: Improvement
> >>            Reporter: Runping Qi
> >>         Assigned To: Runping Qi
> >>         Attachments: patch-1284.txt
> >>
> >>
> >> Right now, the protocol between stream mapper/reducer and the
> >> framework is very inflexible.
> >> The mapper/reducer generates line oriented output. The framework
> >> picks up line by line, and split
> >> each line into a key/value pair. By default, the substring up to the
> >> first tab char is the key, and the
> >> substring after the first tab char is the value.
> >> However, in many cases, the application wants some control over how
> >> the pair is split.
> >> Here, I'd like to introduce the following configuration variables for
> >> that:
> >> 1. "streaming.output.field.separator": the value will be the tab key,
> >> by default.
> >> But the user can specify a different one (e.g. ':', or ', ', etc.)
> >> A map output line can be considered as a list of fields separated by
> >> the separator.
> >> 2. "streaming.num.fields.for.mapout.key":  the number of the first
> >> fields will be used the map output key
> >> (and for sorting in the reduce side).
> >> The default value is 1.
> >> The rest of the fields will be used as the value.  For example, I can
> >> specify the first 5 fields as my mapout key.
> >> 3. "streaming.num.fields.for.partitioning": Sometimes, I want to use
> >> fewer fields for partitioning to
> >> achieve "primary/secondary" composite
> >> key effect as proposed in HADOOP485. The default value is 1.
> >> For example, I can set "streaming.num.fields.for.partitioning" to 3
> >> and "streaming.num.fields.for.mapout.key" to 5.
> >> This effectively amounts to saying that fields 4 and 5 are my
> >> secondary key.
> >> With the above default values, it is compatible with the current
> >> behavior
> >> while introducing a new desirable feature in a clean way.
> >> Thoughts?
> >
> > --
> > This message is automatically generated by JIRA.
> > -
> > You can reply to this email to add a comment to the issue online.
> >


Mime
View raw message