mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Grant Ingersoll (JIRA)" <j...@apache.org>
Subject [jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains
Date Tue, 17 Nov 2009 20:34:39 GMT

    [ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779111#action_12779111
] 

Grant Ingersoll commented on MAHOUT-165:
----------------------------------------

bq. So I found Wolfgang Hoschek, the author of Colt, and he confirms that it is no longer
maintained, and wishes us the best of luck in taking it over for ourselves if we so desired.

I seem to recall him being a Lucene contributor in the past.  Perhaps he would be willing
to donate Colt to Apache?  I don't think we can just bring in it's source and claim it as
ours.  Another option is we see if he would move it over to Google Code and make some of us
committers on the project.  Perhaps Commons Math is interested in it, too.



> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.3
>
>         Attachments: colt.jar, mahout-165-trove.patch, MAHOUT-165-updated.patch, mahout-165.patch,
MAHOUT-165.patch, mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The present implementation
of this hash map is not as efficient as some of the other implementations in non-Apache projects.

> In an experiment, I found that, for get/set operations, the primitive hash of  Colt performance
an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower,
though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an experimental
dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to
19-20 minutes. That's 60% reduction in the delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message