[ https://issues.apache.org/jira/browse/MAHOUT1991?page=com.atlassian.jira.plugin.system.issuetabpanels:commenttabpanel&focusedCommentId=16147397#comment16147397
]
ASF GitHub Bot commented on MAHOUT1991:

Github user rawkintrevo commented on a diff in the pull request:
https://github.com/apache/mahout/pull/334#discussion_r136095747
 Diff: website/docs/algorithms/clustering/dbscan/index.md 
@@ 0,0 +1,101 @@
+
+layout: algorithm
+title: DBSCAN
+theme:
+ name: retromahout
+
+
+### About
+
+[DBSCAN Clustering](https://www.aaai.org/Papers/KDD/1996/KDD96037.pdf)
+ is a simple and efficient density based clustering algorithm for group objects
+ into clusters. DBSCAN differs from other clustering algorithms in the way it
+ clusters objects. DBSCAN uses the density information of points to define clusters
+ Each object is represented as a point in the multidimensional feature space. The algorithm
+ uses the notion of core points, border points and noise points to perform the clustering.
+
+ Parameters of the Algorithm: (Epsilon, Min points)
+ * Epsilon (Eps)  denotes the radius of the sphere that defines the neighbourhood of
a given data point
+ * Min points (Minpts)  the minimum threshold for the number of points in a point p's
Epsneighbourhood to
+ make p a core point
+
+ Basic Definitions
+ * Epsneighbourhood of a point  The Epsneighbourhood of a point 'p' is defined as
the set of points S
+ such that for each x 'belongs to' S, distance(p,x) <= eps
+ * Core point  A point 'p' is called a core point if the number of points in its Epsneighbourhood
is >= Minpts
+ * Border point  A point that is not a core point but such that there exists a 'p' in
its neighbourhood
+ such that p is a core point
+ * Noise point  Any point that does not classify as a core point or border point is
a noise point. Noise points
+ do not belong to any cluster.
+
+ Algorithm:
+ The DBSCAN algorithm takes as parameters Eps and MinPts. Eps restricts the neighbourhood
of a point and MinPts
+ denotes the threshold for the number of neighbours in Epsneighbours of point p, for
it to form a cluster.
+ The algorithm randomly chooses a point q, and finds it Epsneighbourhood. If the number
of points in the
+ Epsneighbourhood of q is less than MinPts, it is marked as a noise point. Otherwise
it is marked as a core point
+ and a new cluster is created. If p is a core point, iteratively a Epsneighbourhood
query is performed on each of
+ its neighbours and points are added to the created cluster. If no unvisited points
can be added to cluster, the new
+ cluster is complete and no points will be added to the cluster in subsequent iterations.
Then another unvisited point q'
+ is picked up and the same process is repeated. The algorithm terminates when all the
points have been visited
+ i.e. when some of the points are added to clusters and some are marked as noise points.
The total number of
+ Epsneighbourhood queries performed is equal to the size of the data (n) and if no indexing
data structure is
+ used then the calculation of Epsneighbourhood query involved the distance calculation
with all the points (n).
+ Thus, if no indexing data structure is used, the complexity of the DBSCAN algorithm
is O(n^2)
+
+#### Strategy for parallelization
+
+The DBSCAN algorithm is an inherently sequential algorithm. A quick look at the pseudo
code provided in the original
+paper will show why.
+
+The parallelization strategy for DBSCAN involves 3 phases.
+* Data distribution phase preserving spatial locality
+* InCore DBSCAN Clustering
+* Merging InCore clusters to form Global clusters
+
+Few methods to parallelize DBSCAN have been proposed in the recent past and can be found
[here](http://delivery.acm.org/10.1145/2390000/2389081/a62patwary.pdf?ip=14.139.128.15&id=2389081&acc=ACTIVE%20SERVICE&key=045416EF4DDA69D9%2EDB7584019D0D7099%2E4D4702B0C3E38B35%2E4D4702B0C3E38B35&CFID=973347130&CFTOKEN=86904048&__acm__=1503663674_474f7b3eaa352d59c2905c1f530907c1)
and [here](http://delivery.acm.org/10.1145/2840000/2834894/a2gotz.pdf?ip=14.139.128.15&id=2834894&acc=ACTIVE%20SERVICE&key=045416EF4DDA69D9%2EDB7584019D0D7099%2E4D4702B0C3E38B35%2E4D4702B0C3E38B35&CFID=973347130&CFTOKEN=86904048&__acm__=1503663710_9001c4bdf3d4dc3ed6658d8a6aaf1190)
+
+In order to minimize the communication costs for large datasets, we provide an approximate
DBSCAN algorithm. Data division among nodes is done randomly
+violating the spatial locality condition. Clustering is performed InCore and cluster
merging is done by merging two clusters
+that are nearby and that contain significant number of points.
+
+### Parameters
+<div class="tablestriped">
+ <table class="table">
+ <tr>
+ <th>Parameter</th>
+ <th>Description</th>
+ <th>Default Value</th>
+ </tr>
+ <tr>
+ <td><code>'distanceMeasure</code></td>
+ <td>The metric used for calculating distance, see <a href="../distancemetrics.html">Distance
Metrics</a></td>
+ <td><code>'Euclidean</code></td>
+ </tr>
+ <tr>
+ <td><code>'Eps</code></td>
+ <td>The radius that defines the neighbourhood of a point</code></td>
+ <td>NA</td>
+ </tr>
+ <tr>
+ <td><code>'Minpts</code></td>
+ <td>The threshold for number of points in the neighbourhood of another
point</code></td>
+ <td>NA</td>
+ </tr>
+ </tr>
+ </table>
+</div>
+
+
+### Example
+
+ //Using the InCoreDBSCAN module
+ val input = dense((1.0, 1.2, 1.3, 1.4), (1.1, 1.5, 2.5, 1.0), (6.0, 5.2, 5.2, 5.3),
(7.0,6.0, 5.0, 5.0), (10.0, 1.0, 20.0, 10.0))
+
+ import org.apache.mahout.math.algorithms.clustering.InCoreDBSCAN
+ val distanceMetric = DistanceMetricSelector.select(DistanceMetricSelector.namedMetricLookup('Euclidean)))
 End diff 
+1 utilizing existing distanceMetrics framework.
> Implement naive DBSCAN Algorithm  O(n^2) complexity
> 
>
> Key: MAHOUT1991
> URL: https://issues.apache.org/jira/browse/MAHOUT1991
> Project: Mahout
> Issue Type: New Feature
> Components: Algorithms
> Reporter: Aditya AS
> Assignee: Aditya AS
>
> Implement the naive DBSCAN algorithm in Mahout Samsara, as part of the Algorithms Framework.

This message was sent by Atlassian JIRA
(v6.4.14#64029)
