mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robin Anil <robin.a...@gmail.com>
Subject Re: Profiling SequentialAccessSparseVector
Date Thu, 18 Feb 2010 20:29:27 GMT
File it for 0.3 ?


Robin

On Fri, Feb 19, 2010 at 1:56 AM, Jake Mannix <jake.mannix@gmail.com> wrote:

> On Thu, Feb 18, 2010 at 11:55 AM, Robin Anil <robin.anil@gmail.com> wrote:
>
> > I was trying out SeqAccessSparseVector on Canopy Clustering using
> Manhattan
> > distance. I found performance to be really bad. So I profiled it with
> > Yourkit(Thanks a lot for providing us free license)
> >
> > Since i was trying out manhattan distance, there were a lot of A-B which
> > created a lot of clone operation 5% of the total time
> > there were also so many A+B for adding a point to the canopy to average.
> > this was also creating a lot of clone operations.  90% of the total time
> >
>
> SequentialAccessSparseVector should only be used in a read-only fashion.
>  If
> you are creating an average centroid which is sparse, but it is mutating,
> then it should be RandomAccessSparseVector.  The points which are being
> used
> to create it can be SequentialAccessSparseVector (if they themselves never
> change), but then the method called should be
> SequentialAccessSparseVector.addTo(RandomAccessSparseVector) - this
> exploits
> the fast sequential iteration of SeqAcc, and the fast random-access
> mutatability of RandAcc.
>
>
> >
> > So we definitely needs to improve that..
> >
> > For a small hack. I made the cluster centers RandomAccess Vector. Things
> > are fast again. I dont know whether to commit or not. But something to
> look
> > into in 0.4?
> >
>
> Yeah, cluster *centers* should indeed be RandomAccess.  JIRA / patch so we
> can see exactly what the change is?
>
>  -jake
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message