hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vivek Ratan (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-475) The value iterator to reduce function should be clonable
Date Wed, 27 Jun 2007 12:54:26 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508528

Vivek Ratan commented on HADOOP-475:

I wanted to expand on my previous comment. When I said "this can probably be done just as
well in user code", I didn't necessarily imply that we let each user write his/her code to
do this. What I was implying was that either we build a set of user-level classes (i.e., perhaps
not part of the core platform conceptually, but still written by us) or we develop sample
code and maybe let users copy from it. Seems to me like everytime you want someone to define
a new iterator over a set of values, you need to clone the set of values and sort the copy
using a different comparator, and then provide an iterator over it. This can be a bit tricky
if the values don't all fit in memory - we'll need disk support for it. As Doug points out
in one of his comments for HADOOP-485, we could maybe use SequenceFile for that. 

But I think we first need to figure out how to present this to the user - what classes should
they have, how will the functionality appear to them, etc. Runping, you should probably have
some good insight into this. 

> The value iterator to reduce function should be clonable
> --------------------------------------------------------
>                 Key: HADOOP-475
>                 URL: https://issues.apache.org/jira/browse/HADOOP-475
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Runping Qi
>            Assignee: Owen O'Malley
> In the current framework, when the user implements the reduce method of Reducer class,

> the user can only iterate through the value iterator once. 
> This makes it hard for the user to perform join-like operations with in the reduce method.

> To address problem, one approach is to make the input value iterator clonable. Then the
user can iterate the values in different ways.
> If the iterator can be reset, then the user can perform nested iterations over the data,
> carry out join-likeoperations.
> The user code in reduce method would be something like:
>                   iterator1 = values.clone();
>                   iterator2 = values.clone();
>                  while (iterator1.hasNext()) {
>                       val1 = iterator1.next();
>                       iterator2.reset();
>                       while (iterator2.hasNext()) {
>                            val2 = iterator.next();
>                            do something vased on val1 and val2
>                            .......................
>                       }
>                  }
> One possible optimization is that if the values are sorted based on a secondary key,

> the reset function can take a secondary key as an argument and reset the iterator to
the begining
> position of the secondary key. It will be very helpful if there is a utility that returns
a list of iterators,
> one per secondary key value, from the given iterator:
>                           TreeMap getIteratorsBasedOnSecondaryKey(iterator);
> Each entry in the returned map object is a pair of <secondary key, iterator for the
values with the same secondary key>.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message