mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Eastman <j...@windwardsolutions.com>
Subject Re: Problem using SNAPSHOT kmeans
Date Wed, 06 Jun 2012 14:48:33 GMT
I was able to easily duplicate this exception by creating a Kluster with 
a zero center and requesting the pdf of a zero vector. This invokes 
CosineDistanceMeasure.distance() with two empty vectors, creating a 
corner case where the dotProduct and denominator are both zero. Thus the 
distance is NaN and this propagates to the probabilities vector as {NaN, 
NaN, ... NaN}  and the out of bounds exception in select() that you've 
observed.

The operant line in CosineDistanceMeasure is:

      return 1.0 - dotProduct / denominator;

... and the problem presents when both dotProduct and denominator are 
zero. It seems unreasonable for k-means to fail to cluster zero vectors 
in this case. Seems like in this case the distance ought to return 1.

What do others think?


On 6/6/12 9:53 AM, Jeff Eastman wrote:
> Yes, it looks like the input vectors are empty and this is the source 
> of the error. I'm troubled; however, that empty vectors can have this 
> impact on k-means. I'm going to write a unit test to see if I can 
> duplicate this exception.
>
> On 6/5/12 3:12 PM, Pat Ferrel wrote:
>> I think I found the root but not sure what needs fixing.
>>
>> I took out n-gram generation and the vector now looks like this:
>> Key: https://farfetchers.com/category/collections/source/brice-berard:
>> Value: 
>> https://farfetchers.com/category/collections/source/brice-berard:{701:0.5484552974788475,1876:0.6020428878306935,3620:0.5802940184767269}
>>
>> This works in clustering.
>>
>> It doesn't seem like a malformed vector should crash clustering (it 
>> apparently doesn't in mahout 0.6) but it looks like something in 
>> seq2sparse's n-gram weighting does cause a malformed vector.
>>
>> I'll file a JIRA
>>
>> On 6/5/12 11:48 AM, Pat Ferrel wrote:
>>> Using seqdumper on the TFIDF vectors, that vector is indeed in the list
>>> Key: https://farfetchers.com/category/collections/source/brice-berard:
>>> Value: 
>>> https://farfetchers.com/category/collections/source/brice-berard:{
>>>
>>> Looking in the seqfiles we find the document in part-00005 of 10 in 
>>> no particular part of the file.
>>> Key: https://farfetchers.com/category/collections/source/brice-berard:
>>> Value: ::Title::
>>> Brice Berard | FarFetchers.com
>>> Blog Posts
>>>
>>> On the chance that this originates in seq2sparse I'll try changing 
>>> options until the vector looks different. and try clustering again.
>>>
>>> On 6/5/12 10:43 AM, Pat Ferrel wrote:
>>>> I'm not completely sure what I'm looking at but...
>>>>
>>>> In iterateSeq on iteration #1  of processing vectors/tfidf-vectors 
>>>> it reads
>>>> vector = 
>>>> "https://farfetchers.com/category/collections/source/brice-berard:{"
>>>>
>>>> it's a named vector where the  url is the name, the value is "{", 
>>>> which looks wrong and when that is classified to get a probability 
>>>> it gets
>>>>
>>>> probabilities = 
>>>> "{0:NaN,1:NaN,2:NaN,3:NaN,4:NaN,5:NaN,6:NaN,7:NaN,8:NaN,9:NaN,10:NaN,11:NaN,12:NaN,13:NaN,14:NaN,15:NaN,16:NaN,17:NaN,18:NaN,19:NaN}"
>>>>
>>>> That causes the probabilities.maxValueIndex() = -1 and everything 
>>>> dies.
>>>>
>>>> vector looks wrong, doesn't it? Truncated?
>>>>
>>>> I went back to try the same on mahout 0.6 but iterateSeq does not 
>>>> get called though I used -xm sequential on both runs. I can't see 
>>>> kmeans-clusters/clusters-0 being created on mahout 0.6 either. Is 
>>>> that part of the refactoring?
>>>>
>>>> On 6/4/12 3:07 PM, Pat Ferrel wrote:
>>>>> Some things to try:
>>>>> - Have you verified the contents of your input vectors actually 
>>>>> have data in them?
>>>>> * YES, from the other email you know that the data works fine in 0.6
>>>>> - Can you run the cluster dumper on the 
>>>>> b3/kmeans-clusters/clusters-0 contents?
>>>>> * YES, It is attached from trunk's clusterdump after the failure 
>>>>> of kmeans, of course. A simple data set fortunately.
>>>>> - Is it possible to run the sequential version (-xm sequential)? 
>>>>> If it is you could run it in a debugger to gain more insight.
>>>>> * YES, will report back.
>>>>>
>>>>> On 6/4/12 2:19 PM, Jeff Eastman wrote:
>>>>>> It looks like the probabilities vector returned by 
>>>>>> AbstractClusteringPolicy.classify() has no non-zero elements. In

>>>>>> this case, AbstractClusteringPolicy.select()'s call to 
>>>>>> AbstractVector.maxValueIndex() is returning -1 and that is 
>>>>>> causing the exception.
>>>>>>
>>>>>> How could this happen? I'm not exactly sure, but consider that 
>>>>>> the probabilities vector is calculated in 
>>>>>> AbstractClusteringPolicy.classify() by calling 
>>>>>> DistanceMeasureCluster.pdf() on each of the prior clusters in 
>>>>>> b3/kmeans-clusters/clusters-0. With a CosineDistanceMeasure I 
>>>>>> don't see how this could ever return zero. Certainly, some of 
>>>>>> your vectors will match the prior cluster centers exactly (they 
>>>>>> were sampled from the input) and those values would return 
>>>>>> pdf==1. Even if the cosine distance was 1 the pdf would be 0.5.
>>>>>>
>>>>>> Some things to try:
>>>>>> - Have you verified the contents of your input vectors actually 
>>>>>> have data in them?
>>>>>> - Can you run the cluster dumper on the 
>>>>>> b3/kmeans-clusters/clusters-0 contents?
>>>>>> - Is it possible to run the sequential version (-xm sequential)?

>>>>>> If it is you could run it in a debugger to gain more insight.
>>>>>>
>>>>>> Jeff
>>>>>>
>>>>>> On 6/4/12 12:05 PM, Pat Ferrel wrote:
>>>>>>> Using the CLI to kmeans from several trunk versions I get an

>>>>>>> error I don't understand.  When the job died the 
>>>>>>> b3/canopy-centroids/clusters-0-final contained the random-seeds

>>>>>>> file generated by the kmeans driver and the 
>>>>>>> b3/kmeans-clusters/clusters-0 had several part files but 
>>>>>>> b3/kmeans-clusters/clusters-1 was empty. When I look through
the 
>>>>>>> code from the trace it doesn't make much sense.
>>>>>>>
>>>>>>> Command line:
>>>>>>> mahout kmeans
>>>>>>>   -i b3/vectors/tfidf-vectors/
>>>>>>>   -k 20
>>>>>>>   -c b3/canopy-centroids/clusters-0-final
>>>>>>>   -cl
>>>>>>>   -o b3/kmeans-clusters
>>>>>>>   -ow
>>>>>>>   -cd 0.01
>>>>>>>   -x 30
>>>>>>>   -dm org.apache.mahout.common.distance.CosineDistanceMeasure
>>>>>>>
>>>>>>> Error:
>>>>>>> 12/06/04 07:55:03 INFO common.AbstractJob: Command line 
>>>>>>> arguments: {--clustering=null, 
>>>>>>> --clusters=[b3/canopy-centroids/clusters-0-final], 
>>>>>>> --convergenceDelta=[0.01], 
>>>>>>> --distanceMeasure=[org.apache.mahout.common.distance.CosineDistanceMeasure],

>>>>>>> --endPhase=[2147483647], --input=[b3/vectors/tfidf-vectors/],

>>>>>>> --maxIter=[30], --method=[mapreduce], --numClusters=[20], 
>>>>>>> --output=[b3/kmeans-clusters], --overwrite=null, 
>>>>>>> --startPhase=[0], --tempDir=[temp]}
>>>>>>> 2012-06-04 07:55:03.752 java[67308:1903] Unable to load realm

>>>>>>> info from SCDynamicStore
>>>>>>> 12/06/04 07:55:03 INFO common.HadoopUtil: Deleting 
>>>>>>> b3/canopy-centroids/clusters-0-final
>>>>>>> 12/06/04 07:55:04 WARN util.NativeCodeLoader: Unable to load

>>>>>>> native-hadoop library for your platform... using builtin-java

>>>>>>> classes where applicable
>>>>>>> 12/06/04 07:55:04 INFO compress.CodecPool: Got brand-new compressor
>>>>>>> 12/06/04 07:55:04 INFO kmeans.RandomSeedGenerator: Wrote 20 
>>>>>>> vectors to b3/canopy-centroids/clusters-0-final/part-randomSeed
>>>>>>> 12/06/04 07:55:04 INFO kmeans.KMeansDriver: Input: 
>>>>>>> b3/vectors/tfidf-vectors Clusters In: 
>>>>>>> b3/canopy-centroids/clusters-0-final/part-randomSeed Out: 
>>>>>>> b3/kmeans-clusters Distance: 
>>>>>>> org.apache.mahout.common.distance.CosineDistanceMeasure
>>>>>>> 12/06/04 07:55:04 INFO kmeans.KMeansDriver: convergence: 0.01

>>>>>>> max Iterations: 30 num Reduce Tasks: 
>>>>>>> org.apache.mahout.math.VectorWritable Input Vectors: {}
>>>>>>> 12/06/04 07:55:04 INFO compress.CodecPool: Got brand-new 
>>>>>>> decompressor
>>>>>>> Cluster Iterator running iteration 1 over priorPath: 
>>>>>>> b3/kmeans-clusters/clusters-0
>>>>>>> 12/06/04 07:55:05 INFO input.FileInputFormat: Total input paths

>>>>>>> to process : 1
>>>>>>> 12/06/04 07:55:05 INFO mapred.JobClient: Running job: 
>>>>>>> job_local_0001
>>>>>>> 12/06/04 07:55:06 INFO mapred.MapTask: io.sort.mb = 100
>>>>>>> 12/06/04 07:55:08 INFO mapred.MapTask: data buffer = 
>>>>>>> 79691776/99614720
>>>>>>> 12/06/04 07:55:08 INFO mapred.MapTask: record buffer = 
>>>>>>> 262144/327680
>>>>>>> 12/06/04 07:55:08 INFO mapred.JobClient:  map 0% reduce 0%
>>>>>>> 12/06/04 07:55:09 WARN mapred.LocalJobRunner: job_local_0001
>>>>>>> org.apache.mahout.math.IndexException: Index -1 is outside 
>>>>>>> allowable range of [0,20)
>>>>>>>     at 
>>>>>>> org.apache.mahout.math.AbstractVector.set(AbstractVector.java:439)
>>>>>>>     at 
>>>>>>> org.apache.mahout.clustering.iterator.AbstractClusteringPolicy.select(AbstractClusteringPolicy.java:44)
>>>>>>>     at 
>>>>>>> org.apache.mahout.clustering.iterator.CIMapper.map(CIMapper.java:52)

>>>>>>>
>>>>>>>     at 
>>>>>>> org.apache.mahout.clustering.iterator.CIMapper.map(CIMapper.java:18)

>>>>>>>
>>>>>>>     at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>>>>>>>     at 
>>>>>>> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
>>>>>>>     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>>>>>>>     at 
>>>>>>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)

>>>>>>>
>>>>>>> 12/06/04 07:55:09 INFO mapred.JobClient: Job complete: 
>>>>>>> job_local_0001
>>>>>>> 12/06/04 07:55:09 INFO mapred.JobClient: Counters: 0
>>>>>>> Exception in thread "main" java.lang.InterruptedException: 
>>>>>>> Cluster Iteration 1 failed processing b3/kmeans-clusters/clusters-1
>>>>>>>     at 
>>>>>>> org.apache.mahout.clustering.iterator.ClusterIterator.iterateMR(ClusterIterator.java:186)
>>>>>>>     at 
>>>>>>> org.apache.mahout.clustering.kmeans.KMeansDriver.buildClusters(KMeansDriver.java:229)
>>>>>>>     at 
>>>>>>> org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:149)
>>>>>>>     at 
>>>>>>> org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:108)
>>>>>>>     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>>>>>>     at 
>>>>>>> org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:49)
>>>>>>>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>>>>>     at 
>>>>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>>>>>>     at 
>>>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>>>>>     at java.lang.reflect.Method.invoke(Method.java:597)
>>>>>>>     at 
>>>>>>> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>>>>>>>     at 
>>>>>>> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>>>>>>>     at 
>>>>>>> org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>
>>
>


Mime
  • Unnamed multipart/mixed (inline, None, 0 bytes)
View raw message