hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris Douglas (JIRA)" <j...@apache.org>
Subject [jira] Updated: (HADOOP-5705) Improved tries in TotalOrderPartitioner to eliminate large leaf nodes.
Date Thu, 23 Apr 2009 20:59:30 GMT

     [ https://issues.apache.org/jira/browse/HADOOP-5705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Chris Douglas updated HADOOP-5705:

       Resolution: Fixed
    Fix Version/s: 0.21.0
     Hadoop Flags: [Reviewed]
           Status: Resolved  (was: Patch Available)


I committed this. Thanks, Dick

> Improved tries in TotalOrderPartitioner to eliminate large leaf nodes.
> ----------------------------------------------------------------------
>                 Key: HADOOP-5705
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5705
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.19.1
>            Reporter: Dick King
>            Assignee: Dick King
>             Fix For: 0.21.0
>         Attachments: hadoop-5705-no-names.patch, hadoop-5705.patch
> With the old technology, if a particular node in the trie has many children that contain
no split points, TotalOrderPartitioner creates a separate empty leaf node for each one.  This
takes a lot of space, which in turn limits the depth to which we can grow these trees to avoid
making too many sparse nodes, so there is a parameter that defaults to 2 to control the depth.
 With this patch, I can guarantee that each split point will create only three trie nodes
in the trie: an empty one, a leaf that contains only one split point, and the internal nodes
as needed, for a total space of perhaps 50-200 bytes per split point, even if the entire trie
is elaborated.  There are pathological cases that can cause a recursion overflow during creation,
so i have set the default tree max depth to 200, but I expect almost all tries to be fully
elaborated, which means in turn that each byte of the sought key gets touched at most twice.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message