hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lijie Xu (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (MAPREDUCE-5461) Let users be able to get latest Key in reduce()
Date Thu, 15 Aug 2013 15:37:47 GMT

     [ https://issues.apache.org/jira/browse/MAPREDUCE-5461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Lijie Xu updated MAPREDUCE-5461:
--------------------------------

    Description: 
Reducer generates <K, List(V)> for reduce(). In some cases such as SecondarySort, although
current V and next V share the same K, their actual corresponding Ks are different. For example,
in SecondarySort, map() outputs
Key         Value
<1, 3>        3
<1, 1>        1
<2, 5>        5
<1, 8>        8

After partition by Key.getFirst(), sort and group by Key.getFirst(),
reducer gets:
Key         Value
------Group 1------
<1, 1>        1
<1, 3>        3
<1, 8>        8
------Group 2------
<2, 5>        5

reduce() receives:

Key      List<Value>
<1, 1>   List<1, 3, 8>
<2, 5>   List<5>

When invoking V.next(), we can get next V (e.g, 3). But we do not have API to get its corresponding
Key (e.g, <1, 3>). We can only get the first Key (e.g., <1,1>).

If we let user be able to get latest key, SecondarySort does not need to emit value in map().
So that the network traffic is better.

Another example is Join. If we can get latest Key, we do not need to put table label in both
key and value.



  was:
Reducer generates <K, List(V)> for reduce(). In some cases such as SecondarySort, although
current V and next V share the same K, their actual corresponding Ks are different. For example,
in SecondarySort, map() outputs
Key         Value
<1, 3>        3
<1, 1>        1
<2, 5>        5
<1, 8>        8

After partition by Key.getFirst(), sort and group by Key.getFirst(),
reducer gets:
Key         Value
------Group 1------
<1, 1>        1
<1, 3>        3
<1, 8>        8
------Group 2------
<2, 5>        5

reduce() receives:

Key      List<Value>
<1, 1>   List<1, 3, 8>
<2, 5>   List<5>

When invoking V.next(), we can get next V (e.g, 3). But we do not have API to get its corresponding
Key (e.g, <1, 3>). We can only get the first Key (e.g., <1,1>).

If we let user be able to get latest key, SecondarySort does not need to emit value in map().
So that the network traffic is better.

Another example is Join. If we can get latest Key, we do need to put table label in both key
and value.



    
> Let users be able to get latest Key in reduce()
> -----------------------------------------------
>
>                 Key: MAPREDUCE-5461
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5461
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>    Affects Versions: 1.2.1
>         Environment: Any environment
>            Reporter: Lijie Xu
>              Labels: features
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> Reducer generates <K, List(V)> for reduce(). In some cases such as SecondarySort,
although current V and next V share the same K, their actual corresponding Ks are different.
For example, in SecondarySort, map() outputs
> Key         Value
> <1, 3>        3
> <1, 1>        1
> <2, 5>        5
> <1, 8>        8
> After partition by Key.getFirst(), sort and group by Key.getFirst(),
> reducer gets:
> Key         Value
> ------Group 1------
> <1, 1>        1
> <1, 3>        3
> <1, 8>        8
> ------Group 2------
> <2, 5>        5
> reduce() receives:
> Key      List<Value>
> <1, 1>   List<1, 3, 8>
> <2, 5>   List<5>
> When invoking V.next(), we can get next V (e.g, 3). But we do not have API to get its
corresponding Key (e.g, <1, 3>). We can only get the first Key (e.g., <1,1>).
> If we let user be able to get latest key, SecondarySort does not need to emit value in
map(). So that the network traffic is better.
> Another example is Join. If we can get latest Key, we do not need to put table label
in both key and value.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message