mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Peng Cheng (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration
Date Tue, 23 Jul 2013 00:26:50 GMT

    [ https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13715906#comment-13715906
] 

Peng Cheng edited comment on MAHOUT-1286 at 7/23/13 12:26 AM:
--------------------------------------------------------------

In another hand, I try to solve the problem by implementing FastByIDArrayMap, a slightly more
compact Map implementation than FastByIDMap, it uses binary search to arrange all entries
into a tight array, so its worst-case time complexity for get, put and delete is log( n )
(much slower than double hashing's average O(1)). But has a (marginally) smaller memory footprint
and faster iteration. It has no problem passing all unit tests. But its real performance can
only be shown when embedded in FileDataModel. I'll post the result shortly.

However, I don't feel this is the right direction. If Sean Owen did everything right in his
FastByIDMap, then reducing memory footage to 0.66 times doesn't worth the speed loss.
                
      was (Author: peng):
    In another hand, I try to solve the problem by implementing FastByIDArrayMap, a slightly
more compact Map implementation than FastByIDMap, it uses binary search to arrange all entries
into a tight array, so its worst-case time complexity for get, put and delete is log(n) (much
slower than double hashing's average O(1)). But has a (marginally) smaller memory footprint
and faster iteration. It has no problem passing all unit tests. But its real performance can
only be shown when embedded in FileDataModel. I'll post the result shortly.

However, I don't feel this is the right direction. If Sean Owen did everything right in his
FastByIDMap, then reducing memory footage to 0.66 times doesn't worth the speed loss.
                  
> Memory-efficient DataModel, supporting fast online updates and element-wise iteration
> -------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-1286
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1286
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Collaborative Filtering
>    Affects Versions: 0.9
>            Reporter: Peng Cheng
>            Assignee: Sean Owen
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Most DataModel implementation in current CF component use hash map to enable fast 2d
indexing and update. This is not memory-efficient for big data set. e.g. Netflix prize dataset
takes 11G heap space as a FileDataModel.
> Improved implementation of DataModel should use more compact data structure (like arrays),
this can trade a little of time complexity in 2d indexing for vast improvement in memory efficiency.
In addition, any online recommender or online-to-batch converted recommender will not be affected
by this in training process.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message