mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jake Mannix <jake.man...@gmail.com>
Subject Re: Horrible performance with SequentialAccessSparseVector
Date Thu, 11 Apr 2013 04:29:20 GMT
> In the existing code, assign() comes from AbstractVector and if the
> function is not PLUS or PLUS_ABS, it does this:
>
> for (int i = 0; i < size; i++) {
>  setQuick(i, function.apply(getQuick(i), other.getQuick(i)));
> }

Yeah, this has been a nasty nasty fact forever, and I should read your
patch more carefully, it looks like you're doing the "if the function
preserves zero, apply it sparsely" trick, which if done right, is great.




On Wed, Apr 10, 2013 at 2:56 PM, Dan Filimon <dangeorge.filimon@gmail.com>wrote:

> Good news and bad news!
>
> After weeks of searching, I finally found the bug that made my code run so
> slowly.
> I'm using SequenceFileDirIterable to read the centroids from disk and it
> looks like it's creating SequentialAccessSparseVectors.
>
> Then, when assigning each point to its appropriate vector, we update the
> centroid itself using assign() and a custom function that does a weighted
> sum.
> Herein lies the problem.
>
> In the existing code, assign() comes from AbstractVector and if the
> function is not PLUS or PLUS_ABS, it does this:
>
> for (int i = 0; i < size; i++) {
>   setQuick(i, function.apply(getQuick(i), other.getQuick(i)));
> }
>
> If "this" is a SequentialAccessSparseVector, getting is done in O(log size)
> steps as is setting (except when a new element needs to be inserted which
> can cause the two arrays in OrderedIntDoubleMapping to grow, which means a
> copy of the vector has to be made).
>
> Even assuming copying is not an issue (although it really is, the centroids
> grow more dense as more points are added), that is O(size log size) for one
> assignment where size is the dimension of the vector (NOT the number of
> mappings).
>
> In my case with vectors with 200K dimensions out of which at best 100 are
> non-zero this was VERY painful.
>
> Anyway, here's the link to an experimental patch [1] (it adds NO more
> tests/benchmarks).
> And here's the link to the issue [2].
>
> What should we do?
>
> [1] https://reviews.apache.org/r/10409/
> [2] https://issues.apache.org/jira/browse/MAHOUT-1190#comment-13628314
>



-- 

  -jake

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message