mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Eastman <j...@windwardsolutions.com>
Subject Re: Incremental clustering - Kmeans + Canopy
Date Fri, 21 Jan 2011 03:45:37 GMT
Ted, can you add a little more about hashed vectorization? Would that 
come from hashing the terms to determine their dictionary indices? I can 
see how hashing terms into a large dimensional space could result in 
constant sized vectors, though there might be problems if multiple terms 
hashed to the same index. Is this something we could consider for 
seq2sparse to make online clustering processes work more smoothly?


On 1/20/11 12:08 PM, Ted Dunning wrote:
> On Thu, Jan 20, 2011 at 10:56 AM, Jeff Eastman
> <jdog@windwardsolutions.com>wrote:
>
>> Hi Veronica,
>>
>> I've only tried incremental clustering as a thought-experiment but the kind
>> of problem you are attacking has many areas of applicability. The problem
>> you are seeing is the new articles bring new terms with them and this will
>> produce different cardinality vectors as new articles are added. You can
>> trick the Vector implementation by creating all the vectors with maxInt
>> cardinality but the current Mahout text vectorization (seq2sparse) does not
>> handle the growth in the directory which incremental additions would entail.
>> If we could prime seq2sparse with with the dictionary from the last addition
>> we might be able to support incremental vectorization with minimal changes.
>>
> Jeff, using hashed vectorization would solve this as well because the
> document vectors will always have constant size.  Commonly used distances
> should work unchanged with a hashed representation although you might have a
> few scaling surprises with multiple probes.
>
>
>> I don't completely agree with MIA 11.3.1's "use canopy clustering" phrase;
>> I think it is a bit misleading. Each of the clustering algorithms (including
>> canopy) has two phases: cluster generation and vector classification using
>> those clusters. I think the best choice for a maximum likelihood classifier
>> would actually be KMeansDriver.clusterData() and not the CanopyDriver
>> version (which requires t1 and t2 values to initialize the clusterer but
>> these are never used for classification).
>>
>> To really implement the case study it would seem to me to require a single
>> threshold classification to avoid assigning new articles to existing
>> clusters which were too dissimilar to really fit. Then these leftovers could
>> be used to generate new clusters which could then be added to the list.
>>
>> Perhaps one of the authors can add some clarification on this too?
>>
>> Jeff
>>
>> On 1/20/11 8:24 AM, Veronica Joh wrote:
>>
>>> Hi
>>> I have large number of artcles clustered by kmeans.
>>> For the new articles that comes in, it says I need to "use canopy
>>> clustering to assign it to the cluster whose centroid is closest based on a
>>> very small distance threshold" according to Mahout in Action book.
>>> I'm not sure how to add new article canopies to the existing cluster.
>>>
>>> So I'm saving batch articles in a list of Cluster like this.
>>> List<Cluster>   clusters = new ArrayList<Cluster>();
>>>
>>> For the new article canopies, I'm trying following to measure the
>>> distance, but I get error like this. "Required cardinality 11981 but got
>>> 77372"
>>> Text key = new Text();
>>> Canopy value = new Canopy();
>>> DistanceMeasure measure = new EuclideanDistanceMeasure();
>>> while (reader.next(key, value)){
>>>       for (int i=0; i<clusters.size(); i++){
>>>          double d = measure.distance(clusters.get(i).getCenter(),
>>> value.getCenter());
>>>       }
>>> }
>>>
>>> Is this how to compare cluster centroids with new canopies?  or Did I
>>> misundertand something?
>>> Can you please help me so I can get the online news clustering working?
>>> Thank you very much!
>>>
>>


Mime
  • Unnamed multipart/mixed (inline, None, 0 bytes)
View raw message