mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eugene Kirpichov (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAHOUT-823) RandomAccessSparseVector.dot with another non-sequential vector can be extremely non-symmetric in its performance
Date Fri, 30 Sep 2011 13:26:45 GMT

    [ https://issues.apache.org/jira/browse/MAHOUT-823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13118049#comment-13118049
] 

Eugene Kirpichov commented on MAHOUT-823:
-----------------------------------------

Ted, since the code change required is so small, I simply included it into the task description.
Is there any particular reason to convert that into a patch (e.g. the necessity to follow
a formal procedure)?

Sean, right, the goal is to avoid doing many lookups; I actually only had in mind the trivial
case of dotting two RandomAccess vectors, but your comment got me thinking on what to do in
the general case.

Random-random: Smallest leads (what's proposed)
Seq-seq: Merge (what's implemented)
Random-seq:
If random leads, number of operations is O((num-nonzero-in-r)^2) because there's this many
lookups into the sequential vector, each taking linear time.
If sequential leads (what's implemented), number of operations is O(num nonzero in sequential).

It's difficult to decide which to use, given that we don't know the constant factors, but
at least having the sequential lead will never have quadratic behavior, so I suggest to leave
it as is.

---
Bottom line:
I would suggest to implement just the proposed change.
                
> RandomAccessSparseVector.dot with another non-sequential vector can be extremely non-symmetric
in its performance
> -----------------------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-823
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-823
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.5
>            Reporter: Eugene Kirpichov
>            Assignee: Sean Owen
>              Labels: dot, dot-product, vector
>             Fix For: 0.6
>
>         Attachments: MAHOUT-823.patch
>
>
> http://codesearch.google.com/#6LK_nEANBKE/math/src/main/java/org/apache/mahout/math/RandomAccessSparseVector.java&l=172
> The complexity of the algorithm is O(num nondefault elements in this), while it could
clearly be O(min(num nondefault in this, num nondefault in x)).
> This can be fixed by adding this code before line 189.
> {code}
> if(x.getNumNondefaultElements() < this.getNumNondefaultElements()) {
>   return x.dot(this);
> }
> {code}
> An easy case where this asymmetry is very apparent and makes a huge difference in performance
is K-Means clustering.
> In K-Means for high-dimensional points (e.g. those that arise in text retrieval problems),
the centroids often have a huge number of non-zero components, whereas points have a small
number of them.
> So, if you make a mistake and use centroid.dot(point) in your code for computing the
distance, instead of point.dot(centroid), you end up with orders of magnitude worse performance
(which is what we actually observed - the clustering time was a couple of minutes with this
fix and over an hour without it).
> So, perhaps, if you make this fix, quite a few people who had a similar case but didn't
notice it will suddenly have a dramatic performance increase :)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message