mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bae, Jae Hyeon" <>
Subject Re: DBScan in Mahout
Date Wed, 01 Feb 2012 03:11:00 GMT

I'd asked the same question before, so I might be the best person to answer
this question. Also, I implemented DBSCAN.

You can do similarity with spectral clustering because spectral clustering
is based on similarity value but the strength of DBSCAN is we don't need to
specify the number of cluster and we don't have to figure out prior
information of data distribution, which does not shine with spectral

DBSCAN needs to build the distance matrix for a set of elements. If your
dataset is high dimensional, computational cost for building the distance
matrix would be O(N^2). Otherwise, you can use the spatial index such as
KD-tree to mitigate computational cost for building the distance matrix.
After you build the distance matrix, you can easily implement expand
cluster of DBSCAN algorithm.

You can use Hadoop MapReduce for building the distance matrix. Also, if the
distance matrix is too big to be loaded into the memory or you don't want
to use either disk-based search structure or another remote static
resource, you can replace union-find algorithm with expand cluster of
DBSCAN. Implementing union-find algorithm with MapReduce is not so

Best, Jae

On Tue, Jan 31, 2012 at 7:34 PM, Vikas Pandya <> wrote:

> Hello,
> Does anybody know if there are any plans of including DBScan in Mahout?
> for that matter of fact, any density based algorithms in mahout?

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message