mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Drew Farris (JIRA)" <>
Subject [jira] Commented: (MAHOUT-344) Minhash based clustering
Date Wed, 31 Mar 2010 01:30:27 GMT


Drew Farris commented on MAHOUT-344:

Hi Cristi,

Sounds like a great start. Answers for a couple of your questions:

Is there a standard formatting for the input on each clustering alg or the input format follows
the same rules for all algorithms, and then the users write conversion tools which ?

Take a look at the various Vector clases in the math module and the VectorWritable wrapper.
Most of the clustering algorithms take vectors of one kind or another as input and the assumption
is that users will write tools to convert their data to these common formats. The wiki page is a good place to start

would it be ok if I attach the code which does an example of running min-hash clustering in
the examples dirs ? (it would first convert the dataset format accordingly)

Go for it, code is good, patches are even better, see:
and simply attach it to this issue. 

> Minhash based clustering 
> -------------------------
>                 Key: MAHOUT-344
>                 URL:
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.3
>            Reporter: Ankur
>            Assignee: Ankur
>         Attachments: MAHOUT-344-v1.patch
> Minhash clustering performs probabilistic dimension reduction of high dimensional data.
The essence of the technique is to hash each item using multiple independent hash functions
such that the probability of collision of similar items is higher. Multiple such hash tables
can then be constructed  to answer near neighbor type of queries efficiently.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message