mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Eastman <j...@windwardsolutions.com>
Subject Re: Clustering from DB
Date Mon, 13 Jul 2009 16:41:21 GMT
r793620 fixes the KMeansDriver.isConverged() method to iterate over all 
cluster part files. Unit test now runs without error and the synthetic 
control job completes too.


Jeff Eastman wrote:
> In this case, the code should be reading all of the clusters into 
> memory to see if they have all converged. These may be split into 
> multiple part files if more than one reducer is specified. So /* is 
> the correct file pattern and it is the calling site that should remove 
> the /part-0000 reference. The code in isConverged should loop through 
> all the parts, returning if they have all converged or not.
>
> I'll take a detailed look tomorrow.
>
>
> Grant Ingersoll wrote:
>> Hmm, that might be a mistake on my part when trying to resolve how 
>> Hadoop 0.20 now resolves globs.  I somewhat blindly applied "/*" 
>> where needed, but I think it is likely worth revistiing here where a 
>> specific file is needed?
>>
>> -Grant
>>
>> On Jul 10, 2009, at 3:08 PM, nfantone wrote:
>>
>>> This error is still bugging me. The exception:
>>>
>>> WARNING: java.io.FileNotFoundException: File
>>> output/clusters-0/part-00000/* does not exist.
>>> java.io.FileNotFoundException: File output/clusters-0/part-00000/*
>>> does not exist.
>>>
>>> ocurrs first at:
>>>
>>> org.apache.mahout.clustering.kmeans.KMeansDriver.isConverged(KMeansDriver.java:298)

>>>
>>>
>>> which corresponds to:
>>>
>>>  private static boolean isConverged(String filePath, JobConf conf,
>>> FileSystem fs)
>>>      throws IOException {
>>>    Path outPart = new Path(filePath + "/*");
>>>    SequenceFile.Reader reader = new SequenceFile.Reader(fs, outPart,
>>> conf);  <-- THIS
>>>    ...
>>>  }
>>>
>>> where isConverged() is called in this fashion:
>>>
>>> return isConverged(clustersOut + "/part-00000", conf, fs);
>>>
>>> by runIteration(), which is previously invoked by runJob() like:
>>>
>>>     String clustersOut = output + "/clusters-" + iteration;
>>>      converged = runIteration(input, clustersIn, clustersOut, 
>>> measureClass,
>>>          delta, numReduceTasks, iteration);
>>>
>>> Consequently, assuming its the first iteration and the output folder
>>> has been named "output" by the user, the SequenceFile.Reader receives
>>> "output/clusters-0/part-00000/*" as a path, which is non-existent. I
>>> believe the path should end in "part-00000" and the  + "/*" should be
>>> removed... although someone, evidently, thought otherwise.
>>>
>>> Any feedback?
>>>
>>> On Mon, Jul 6, 2009 at 5:39 PM, nfantone<nfantone@gmail.com> wrote:
>>>> I was using Canopy to create input clusters, but the error appeared
>>>> while running kMeans (if I run kMeans' job only with previously
>>>> created clusters from Canopy placed in output/canopies as initial
>>>> clusters, it still fails). I noticed no other problems. I was using
>>>> revision 790979 before updating.  Strangely, there were no changes in
>>>> the job and drivers class from that revision. svn diff shows that the
>>>> only classes that changed in org.apache.mahout.clustering.kmeans
>>>> package were KMeansInfo.java and RandomSeedGenerator.java
>>>>
>>>> On Mon, Jul 6, 2009 at 3:55 PM, Jeff 
>>>> Eastman<jdog@windwardsolutions.com> wrote:
>>>>> Hum, no, it's looking for the output of the first iteration. Were 
>>>>> there
>>>>> other errors? What was the last revision you were running? It does 
>>>>> look like
>>>>> something got horked, as it should be looking for 
>>>>> output/clusters-0/*. Can
>>>>> you diff the job and driver class to see what changed?
>>>>>
>>>>> Jeff
>>>>>
>>>>> nfantone wrote:
>>>>>>
>>>>>> Fellows, today I updated to revision 791558 and while running 
>>>>>> kMeans I
>>>>>> got the following exception:
>>>>>>
>>>>>> WARNING: java.io.FileNotFoundException: File
>>>>>> output/clusters-0/part-00000/* does not exist.
>>>>>> java.io.FileNotFoundException: File output/clusters-0/part-00000/*
>>>>>> does not exist.
>>>>>>
>>>>>> The algorithm isn't interrupted, though. But this exception wasn't
>>>>>> thrown before the update and, to me, its message is not quite clear.
>>>>>> It seems as it's looking for any file inside a "part-00000" 
>>>>>> directory,
>>>>>> which doesn't exist; and, as far as I know, "part-xxxxx" are default
>>>>>> names for output files.
>>>>>>
>>>>>> I could show the entire stack trace, if needed. Any pointers?
>>>>>>
>>>>>>
>>>>>> On Thu, Jul 2, 2009 at 3:16 PM, nfantone<nfantone@gmail.com>
wrote:
>>>>>>
>>>>>>>
>>>>>>> Thanks for the feedback, Jeff.
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> The logical format of input to KMeans is <Key, Vector>
as it is in
>>>>>>>> sequence
>>>>>>>> file format, but the Key is never used. To my knowledge,
there 
>>>>>>>> is no
>>>>>>>> requirement to assign identifiers to the input points*. Users

>>>>>>>> are free
>>>>>>>> to
>>>>>>>> associate an arbitrary name field with each vector - also
label 
>>>>>>>> mappings
>>>>>>>> may
>>>>>>>> be assigned - but these are not manipulated by KMeans or
any of 
>>>>>>>> the
>>>>>>>> other
>>>>>>>> clustering applications. The name field is now used as a
vector
>>>>>>>> identifier
>>>>>>>> by the KMeansClusterMapper - if it is non-null - in the output

>>>>>>>> step
>>>>>>>> only.
>>>>>>>>
>>>>>>>
>>>>>>> The key may not be used internally, but externally they can 
>>>>>>> prove to
>>>>>>> be pretty useful. For me, keys are userIDs and each Vector 
>>>>>>> represents
>>>>>>> his/her historical behavior. Being able to collect the output
>>>>>>> information as <UserID, ClusterID> is quite neat as it
allows me 
>>>>>>> to,
>>>>>>> for instance, retrieve user information using data directly from
a
>>>>>>> HDFS file's field.
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>
>> --------------------------
>> Grant Ingersoll
>> http://www.lucidimagination.com/
>>
>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) 
>> using Solr/Lucene:
>> http://www.lucidimagination.com/search
>>
>>
>>
>
>
>


Mime
View raw message