mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jake Mannix (JIRA)" <j...@apache.org>
Subject [jira] Commented: (MAHOUT-201) OrderedIntDoubleMapping / SparseVector is unnecessarily slow
Date Wed, 18 Nov 2009 18:42:39 GMT

    [ https://issues.apache.org/jira/browse/MAHOUT-201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779588#action_12779588
] 

Jake Mannix commented on MAHOUT-201:
------------------------------------

The reason I started looking at Mahout's old OrderedIntDoubleMapping was that I didn't see
any implementation in Colt which was actually just a wrapper around ordered int[]'s and their
corresponding double[] values.  Colt has *unordered* int/double mappings (which are hash map
impls), which a) doesn't have order information, and b) uses a hash, so isn't either as efficient
for iteration, or for storage).

Once mahout-colt is committed in MAHOUT-165, then yes, we can juice-up a Colt implementation
(I haven't pulled down your patch yet, Shashi, I'll check it out), but from what I can see,
they have lots of fast map implementations, and as far as extending their DoubleMatrix1D class
(i.e. Vector), the SparseDoubleMatrix1D is map-based as well, so there is no equivalent of
just an int[], double[] pair implementing Vector.

> OrderedIntDoubleMapping / SparseVector is unnecessarily slow
> ------------------------------------------------------------
>
>                 Key: MAHOUT-201
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-201
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Jake Mannix
>             Fix For: 0.3
>
>         Attachments: MAHOUT-201.patch
>
>
> In the work on MAHOUT-165, I find that while Colt's sparse vector implementation is great
from a hashing standpoint (it's memory efficient and fast for random-access), they don't provide
anything like the OrderedIntDoublePair - i.e. a vector implementation which is *not* fast
for random access, or out-of-order modification, but is minimally sized memory-wise and blazingly
fast for doing read-only dot-products and vector sums (where the latter is read-only on inputs,
and is creating new output) with each other, and with DenseVectors.
> This line of thinking got me looking back at the current SparseVector implementation
we have in Mahout, because it *is* based on an int[] and a double[].  Unfortunately, it's
not at all optimized for the cases where it can outperform all other sparse impls:
> * it should override dot(Vector) and plus(Vector) to check whether the input is a DenseVector
or a SparseVector (or, once we have an OpenIntDoubleMap implementation of SparseVector, that
case as well), and do specialized operations here.
> * even when those particular methods aren't being used, the AllIterator and NonZeroIterator
inner classes are very inefficient:
> ** minor things like caching the values.numMappings() and values.getIndices in final
instance variables in the Iterators
> ** the huge performance drain of Element.get() : {code} public double get() {  return
values.get(ind);  } {code}, which is implemented as a binary search on index values array
(the index of which was already known!) followed by the array lookup
> This last point is probably the entire reason why we've seen performance problems with
the SparseVector, as it's in both the NonZeroIterator and the AllIterator, and so turned any
O(numNonZeroElements) operations into O(numNonZeroElements * log(numNonZeroElements)) (with
some additional constant factors for too much indirection thrown in for good measure).
> Unless there is another JIRA ticket which has a patch fixing this which I didn't notice,
I can whip up a patch (I've got a similar implementation over in decomposer I can pull stuff
out of, although mine is simpler because it is immutable, so it's not just a copy and paste).
> We don't have any benchmarking code anywhere yet, do we?  Is there a JIRA ticket open
for that already?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message