mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <>
Subject Re: Practical Advice on Clustering
Date Fri, 17 Oct 2008 18:42:33 GMT

On Oct 13, 2008, at 12:17 AM, Vaijanath N. Rao wrote:

> Hi Grant,
> My replies are inline.
> Grant Ingersoll wrote:
>> I'm looking into adding document clustering capabilities to Solr,  
>> using Mahout [1][2].  I already have search-results clustering,  
>> thanks to Carrot2.  What I'm looking for is practical advice on  
>> deploying a system that is going to cluster potentially large  
>> corpora (but not huge, and let's assume one machine for now, but it  
>> shouldn't matter)
>> Here are some thoughts I have:
>> In Solr, I expect to send a request to go off and build the  
>> clusters for some non-trivial set of documents in the index.  The  
>> actual building needs to happen in a background thread, so as to  
>> not hold up the caller.
> Bingo It's better to spawn a new process for clustering rather than  
> to hold up the caller. If there is a status page indicating the  
> status of this clustering algorithm it would be better as the caller  
> can than check against this status page to know what the current  
> status is.
>> My thinking is the request will come in and spawn off a job that  
>> goes and calculates a similarity matrix for all the documents in  
>> the set (need to store the term vectors in Lucene) and then goes  
>> and runs the clustering job (user configurable, based on the  
>> implementations we have: k-means, mean-shift, fuzzy, whatever) and  
>> stores the results into Solr's data directory somehow (so that it  
>> can be replicated, but not a big concern of mine at the moment)
> If we are going to work on similarity matrix would like to add FIHC  
> (Frequent Item set Hierarchical clustering) If you need I can  
> definitely pitch in with this. Ideally we should target replication  
> and I think the idea is good.

I'm open for anything.  I figured start with the simplest, but if you  
have references, that would be cool.

>> Then, at any time, the application can ask Solr for the clusters  
>> (whatever that means) and it will return them (docids, fields,  
>> whatever the app asks for).  If the background task isn't done yet,  
>> the results set will be empty, or it will return a percentage  
>> completion or something useful.
> In my opinion it is better to return the percentage of completion  
> rather than the top clusters at time X if the clustering is not yet  
> over. In most clustering cases the input data decides the centroid  
> of the clusters so change in input mite change the centroid and you  
> mite get different results for different input sample derived from  
> same data set.

Yeah, I think percent complete is good, also will keep the amount of  
traffic down.  But, in true Solr option, maybe it can be optional to  
send partial clusters, too.

>> Obviously, my first step is to get it working, but...
>> Is it practical to return a partially done set of results?  i.e.  
>> the best clusters so far, with perhaps a percentage to completion  
>> value or perhaps a list of the comparisons that haven't been done  
>> yet?
>> What if something happens?  How can I make Mahout fault-tolerant,  
>> such that, conceivably I could pick up the job again from where it  
>> went down, or at least be able to get the clusters so far.  How do  
>> people approach this to date (w/ or w/o Mahout)  What needs to be  
>> done in Mahout to make this possible?  I suspect Hadoop has some  
>> support for it.
> Not sure weather Mahoot is fault tolerant in that respect.  But I  
> guess other members can comment on this.

No, I don't think it is at the moment in this respect.  I mean, Hadoop  
has it, so it probably isn't that hard to add...

View raw message