hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Runping Qi (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-485) allow a different comparator for grouping keys in calls to reduce
Date Thu, 19 Apr 2007 14:12:15 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12490071
] 

Runping Qi commented on HADOOP-485:
-----------------------------------

A real example is "session analysis" on the log data.

Let's say we have a data set containing a collection of records, each has a timestamp field
and other fields.
We want to group the records by certain fields (the primary key fields), but we want the records
within each group to be sorted by timestamps so that the user's reduce function can analyse
the records in one pass. Otherwise, the user's code has to cache all the records within each
group, sort them by the timestamps first 
(this is effectively the common approach people take, but it will run out memory if the number
of records per
group is very large).



> allow a different comparator for grouping keys in calls to reduce
> -----------------------------------------------------------------
>
>                 Key: HADOOP-485
>                 URL: https://issues.apache.org/jira/browse/HADOOP-485
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.5.0
>            Reporter: Owen O'Malley
>         Assigned To: Tahir Hashmi
>         Attachments: Hadoop-485-pre.patch, TestUserValueGrouping.java.patch
>
>
> Some algorithms require that the values to the reduce be sorted in a particular order,
but extending the key with the additional fields causes  them to be handled by different calls
to reduce. (The user then collects the values until they detect a "real" key change and then
processes them.)
> It would be much easier if the framework let you define a second comparator that did
the grouping of values for reduces. So your reduce inputs look like:
> A1, V1
> A2, V2
> A3, V3
> B1, V4
> B2, V5
> instead of getting calls to reduce that look like:
> reduce(A1, {V1}); reduce(A2, {V2}); reduce(A3, {V3}); reduce(B1, {V4}); reduce(B2, {V5});
> you could define the grouping comparator to just compare the letters and end up with:
> reduce(A1, {V1,V2,V3}); reduce(B1, {V4,V5});
> which is the desired outcome. Note that this assumes that the "extra" part of the key
is just for sorting because the reduce will only see the first representative of each equivalence
class.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message