mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Eastman <j...@windwardsolutions.com>
Subject Re: Clustering from DB
Date Sun, 12 Jul 2009 22:51:46 GMT
In this case, the code should be reading all of the clusters into memory 
to see if they have all converged. These may be split into multiple part 
files if more than one reducer is specified. So /* is the correct file 
pattern and it is the calling site that should remove the /part-0000 
reference. The code in isConverged should loop through all the parts, 
returning if they have all converged or not.

I'll take a detailed look tomorrow.


Grant Ingersoll wrote:
> Hmm, that might be a mistake on my part when trying to resolve how 
> Hadoop 0.20 now resolves globs.  I somewhat blindly applied "/*" where 
> needed, but I think it is likely worth revistiing here where a 
> specific file is needed?
>
> -Grant
>
> On Jul 10, 2009, at 3:08 PM, nfantone wrote:
>
>> This error is still bugging me. The exception:
>>
>> WARNING: java.io.FileNotFoundException: File
>> output/clusters-0/part-00000/* does not exist.
>> java.io.FileNotFoundException: File output/clusters-0/part-00000/*
>> does not exist.
>>
>> ocurrs first at:
>>
>> org.apache.mahout.clustering.kmeans.KMeansDriver.isConverged(KMeansDriver.java:298)

>>
>>
>> which corresponds to:
>>
>>  private static boolean isConverged(String filePath, JobConf conf,
>> FileSystem fs)
>>      throws IOException {
>>    Path outPart = new Path(filePath + "/*");
>>    SequenceFile.Reader reader = new SequenceFile.Reader(fs, outPart,
>> conf);  <-- THIS
>>    ...
>>  }
>>
>> where isConverged() is called in this fashion:
>>
>> return isConverged(clustersOut + "/part-00000", conf, fs);
>>
>> by runIteration(), which is previously invoked by runJob() like:
>>
>>     String clustersOut = output + "/clusters-" + iteration;
>>      converged = runIteration(input, clustersIn, clustersOut, 
>> measureClass,
>>          delta, numReduceTasks, iteration);
>>
>> Consequently, assuming its the first iteration and the output folder
>> has been named "output" by the user, the SequenceFile.Reader receives
>> "output/clusters-0/part-00000/*" as a path, which is non-existent. I
>> believe the path should end in "part-00000" and the  + "/*" should be
>> removed... although someone, evidently, thought otherwise.
>>
>> Any feedback?
>>
>> On Mon, Jul 6, 2009 at 5:39 PM, nfantone<nfantone@gmail.com> wrote:
>>> I was using Canopy to create input clusters, but the error appeared
>>> while running kMeans (if I run kMeans' job only with previously
>>> created clusters from Canopy placed in output/canopies as initial
>>> clusters, it still fails). I noticed no other problems. I was using
>>> revision 790979 before updating.  Strangely, there were no changes in
>>> the job and drivers class from that revision. svn diff shows that the
>>> only classes that changed in org.apache.mahout.clustering.kmeans
>>> package were KMeansInfo.java and RandomSeedGenerator.java
>>>
>>> On Mon, Jul 6, 2009 at 3:55 PM, Jeff 
>>> Eastman<jdog@windwardsolutions.com> wrote:
>>>> Hum, no, it's looking for the output of the first iteration. Were 
>>>> there
>>>> other errors? What was the last revision you were running? It does 
>>>> look like
>>>> something got horked, as it should be looking for 
>>>> output/clusters-0/*. Can
>>>> you diff the job and driver class to see what changed?
>>>>
>>>> Jeff
>>>>
>>>> nfantone wrote:
>>>>>
>>>>> Fellows, today I updated to revision 791558 and while running 
>>>>> kMeans I
>>>>> got the following exception:
>>>>>
>>>>> WARNING: java.io.FileNotFoundException: File
>>>>> output/clusters-0/part-00000/* does not exist.
>>>>> java.io.FileNotFoundException: File output/clusters-0/part-00000/*
>>>>> does not exist.
>>>>>
>>>>> The algorithm isn't interrupted, though. But this exception wasn't
>>>>> thrown before the update and, to me, its message is not quite clear.
>>>>> It seems as it's looking for any file inside a "part-00000" 
>>>>> directory,
>>>>> which doesn't exist; and, as far as I know, "part-xxxxx" are default
>>>>> names for output files.
>>>>>
>>>>> I could show the entire stack trace, if needed. Any pointers?
>>>>>
>>>>>
>>>>> On Thu, Jul 2, 2009 at 3:16 PM, nfantone<nfantone@gmail.com> wrote:
>>>>>
>>>>>>
>>>>>> Thanks for the feedback, Jeff.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> The logical format of input to KMeans is <Key, Vector>
as it is in
>>>>>>> sequence
>>>>>>> file format, but the Key is never used. To my knowledge, there

>>>>>>> is no
>>>>>>> requirement to assign identifiers to the input points*. Users

>>>>>>> are free
>>>>>>> to
>>>>>>> associate an arbitrary name field with each vector - also label

>>>>>>> mappings
>>>>>>> may
>>>>>>> be assigned - but these are not manipulated by KMeans or any
of the
>>>>>>> other
>>>>>>> clustering applications. The name field is now used as a vector
>>>>>>> identifier
>>>>>>> by the KMeansClusterMapper - if it is non-null - in the output
step
>>>>>>> only.
>>>>>>>
>>>>>>
>>>>>> The key may not be used internally, but externally they can prove
to
>>>>>> be pretty useful. For me, keys are userIDs and each Vector 
>>>>>> represents
>>>>>> his/her historical behavior. Being able to collect the output
>>>>>> information as <UserID, ClusterID> is quite neat as it allows
me to,
>>>>>> for instance, retrieve user information using data directly from
a
>>>>>> HDFS file's field.
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) 
> using Solr/Lucene:
> http://www.lucidimagination.com/search
>
>
>


Mime
View raw message