hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Runping Qi (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-475) The value iterator to reduce function should be clonable
Date Wed, 27 Jun 2007 14:52:26 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508566

Runping Qi commented on HADOOP-475:

The data_join package in contrib provides user level implementation of this feature.
It is an in-memory solution for now. But it does allow to plugin disk-backed iterators.
It may not be as efficient as a framework backed solution. But it is simple and works well
in most common cases, and it does not introduce complexity in the framework.

In my opinion, it may be not worthy to pursure this Jira further. Rather, it is a low hang
fruit to just provide a disk backed 
iterator plugin to the data_join package.

> The value iterator to reduce function should be clonable
> --------------------------------------------------------
>                 Key: HADOOP-475
>                 URL: https://issues.apache.org/jira/browse/HADOOP-475
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Runping Qi
>            Assignee: Owen O'Malley
> In the current framework, when the user implements the reduce method of Reducer class,

> the user can only iterate through the value iterator once. 
> This makes it hard for the user to perform join-like operations with in the reduce method.

> To address problem, one approach is to make the input value iterator clonable. Then the
user can iterate the values in different ways.
> If the iterator can be reset, then the user can perform nested iterations over the data,
> carry out join-likeoperations.
> The user code in reduce method would be something like:
>                   iterator1 = values.clone();
>                   iterator2 = values.clone();
>                  while (iterator1.hasNext()) {
>                       val1 = iterator1.next();
>                       iterator2.reset();
>                       while (iterator2.hasNext()) {
>                            val2 = iterator.next();
>                            do something vased on val1 and val2
>                            .......................
>                       }
>                  }
> One possible optimization is that if the values are sorted based on a secondary key,

> the reset function can take a secondary key as an argument and reset the iterator to
the begining
> position of the secondary key. It will be very helpful if there is a utility that returns
a list of iterators,
> one per secondary key value, from the given iterator:
>                           TreeMap getIteratorsBasedOnSecondaryKey(iterator);
> Each entry in the returned map object is a pair of <secondary key, iterator for the
values with the same secondary key>.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message