About density based clustering(DBSCAN), since it should be supported a
random access for distance matrix, it has a scalability problem. So, to
overcome this limitation, I've chosen an approach using unionfind. As you
know, unionfind is scalable and can be suitable to MapReduce way. To apply
unionfind, we have to find initial clusters which are within MinPts and Eps
boundary and check two clusters are sharing some elements in pairwise way.
Armed with these paired clusters, we can apply unionfind to that.
This method result in a slightly different output comparing to DBSCAN,
because expanding cluster through unionfind does not regard boundary points
and core points. Anyway, this can be scalable.
I've implemented already based on Mahout. To vectorize documents, I used
Mahout seq2sparse module.
The disadvantage of above method is that we have to calculate O(N^2) module
twice. First is calculating pairwise similarity distance, second is checking
two clusters can be connected in a pairwise way. Although we are borrowing a
distributed computing power on Hadoop, O(N^2) is not tractable if we have
more one million documents. The worst thing is that checking two clusters
are sharing some elements can cost another linear time complexity. To boost
up this, I've used a kind of bitmap, but that's slow too :(
I thought I'd like to contribute to Mahout regarding density based
clustering. Unfortunately, I don't have much free time :(
Best, Jay
2011/3/10 Ted Dunning <ted.dunning@gmail.com>
> I don't know if iterative solutions are actually required. Such iterations
> are relatively natural, but the same can be said for power methods and
> there
> are dramatic improvements available with those.
>
> Alternative frameworks are an excellent idea. There is a LOT of stuff
> about
> to happen to make alternative frameworks possible in a cluster. One
> alternative platform is MESOS which was accepted as an incubator in
> December. Another is the MapReduce 2.0 stuff that Arun Murthy is working
> on
> at Yahoo but which is not yet released. Both allow mapreduce to
> ultimately
> be demoted from clustergodhood to the status of a library. Both allow
> alternative computational paradigms to be built on an experimental level.
> Neither is ready for primetime.
>
> RBM's would be much easier to code on graphlab, but right now graphlab
> isn't very deployable into production. Getting experience with RBM's is
> reasonable even without graphlab, however.
>
> On Thu, Mar 10, 2011 at 11:17 AM, <jeremy@lewi.us> wrote:
>
> > Ted in a different thread (I think it was in response to Bickson's
> > profiling of Hadoop), I think you pointed out that Hadoop isn't well
> suited
> > for iterative algorithms. Don't RBMs require iterative algorithms? Or is
> it
> > just when you stack them into a deep belief net that you need iterative
> > algorithms?
> >
> > Given the graphical/iterative nature of RBMs would it make more sense to
> > try to build a better framework then mapreduce; e.g scale Bickson's
> > graphlab to the cloud or an opensource equivalent of Google's pregel.
> >
> > There's already some pretty good codes out there for RBMs in addition to
> > GraphLab; there's Theano which leverages gpu. Would there be some way to
> > leverage those codes?
> >
> >
> > J
> >
> >
> > Quoting Ted Dunning <ted.dunning@gmail.com>:
> >
> > RBM's would be very cool. They have some potential for being at least
> as
> >> scalable as LDA.
> >>
> >> How would you go about it?
> >>
> >> On Thu, Mar 10, 2011 at 10:39 AM, Dan Brickley <danbri@danbri.org>
> wrote:
> >>
> >> On 10 March 2011 18:32, Ted Dunning <ted.dunning@gmail.com> wrote:
> >>>
> >>> > Out of all of this, you might get the impression that I don't see
> much
> >>> to
> >>> > add to Mahout and you would be right insofar as this list is
> concerned.
> >>> That
> >>> > doesn't mean that other people won't have other itches. If they do,
> >>> they
> >>> > should drop in on the Mahout mailing lists and see if we can't work
> >>> together
> >>> > to scratch that itch.
> >>>
> >>> Is there any interest in reviving the RBM work?
> >>> https://issues.apache.org/jira/browse/MAHOUT214
> >>>
> >>>
> >>>
> http://en.wikipedia.org/wiki/Boltzmann_machine#Restricted_Boltzmann_Machine
> >>>
> >>>
> >>>
> http://www.scholarpedia.org/article/Boltzmann_machine#Restricted_Boltzmann_machines
> >>>
> >>> Hinton in http://www.youtube.com/watch?v=AyzOUbkUf3M is both
> >>> entertaining and persuasive, but I'm out of my maths depth.
> >>> Also http://www.cs.toronto.edu/~hinton/ >
> >>> http://www.cs.toronto.edu/~hinton/science.pdf
> >>>
> >>> cheers,
> >>>
> >>> Dan
> >>>
> >>>
> >>
> >
> >
> >
>
