mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ted Dunning (JIRA)" <>
Subject [jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains
Date Wed, 30 Sep 2009 22:07:23 GMT


Ted Dunning commented on MAHOUT-165:

THanks Jake, that could be very helpful.

The throwing of "Impossible confusion" is done in situations where an impossible condition
has been detected.  For instance, since hash tables are resized when they become partially
filled, it should be impossible for the search loop to exit without finding an empty cell
or a match.  When programming, I have difficulty pronouncing "should" so I try to detect the
situation and signal it with an unchecked exception.  I usually define something like "ImpossibleConditionException",
but didn't in this case.  I use an unchecked exception because it is clear that the application
is not going to be much able to recover from a situation that I don't think could occur.

I left the hard-coding of one option or the other in place because I could see my patch extending
into everything everywhere and wanted to limit the scope of the change.  You are right that
we need to think about how that works.  In most cases, I think that hard-coding is fine just
like hard-coding the use of an ArrayList in some application is not subject to user over-ride.
 There are a few cases where this isn't try, but I think that usually that means that the
vector or matrix should be passed in.  The use of like() may also be indicated.

> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>                 Key: MAHOUT-165
>                 URL:
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.2
>         Attachments: colt.jar, mahout-165-trove.patch, MAHOUT-165-updated.patch, MAHOUT-165.patch,
> In SparseVector, we need primitives hash map for index and values. The present implementation
of this hash map is not as efficient as some of the other implementations in non-Apache projects.

> In an experiment, I found that, for get/set operations, the primitive hash of  Colt performance
an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower,
> Using Colt in Sparsevector improved performance of canopy generation. For an experimental
dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to
19-20 minutes. That's 60% reduction in the delay. 

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message