mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jake Mannix <>
Subject Re: [jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains
Date Thu, 01 Oct 2009 17:53:12 GMT
On Thu, Oct 1, 2009 at 10:10 AM, Ted Dunning <> wrote:

> Btw... the other think that the HashVector does better is inserts.  The
> sorted vector could do much better on average if it deferred sorting until
> an access or iteration was done.  Even iteration doesn't necessarily need
> sorting, but it could by seen as part of the contract.

Iteration doesn't *need* sorting, but I agree that it should be part of the
because doing fast dot products of two OrderedIntDoublePair instances really
needs ordering, so you can just walk them both in parallel.

In decomposer, I ended up having an ImmutableSparseVector subclass which
did the sorting in the constructor (if it wasn't already sorted) and then
allow further inserts/deletes/modifications.

Speaking of other little niceties: caching the norm is something I found
useful for a vector, and keeping track of a boolean "isDirty" which lets you

know whether you've invalidated it and will need to recalculate on the next
call to norm().

Of course, to do this right in Mahout, we'd probably need to have some way
telling the Vector instance what to use for its norm (so it knows whether to

cache it's L^2 norm, L^p norm, or some other inner product applied to

Which gets me thinking: maybe we should have an InnerProduct interface,
similar to DistanceMeasure, which defined how to compute dot().  As it is,
we basically assume in our API that while you may want norm(int p) for
various p, and you may want different DistanceMeasure impls of various
types, we say that dot() is always the Euclidean dot().

Should that be pluggable as well?

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message