mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From nfantone <nfant...@gmail.com>
Subject Re: Clustering from DB
Date Fri, 10 Jul 2009 19:08:37 GMT
This error is still bugging me. The exception:

WARNING: java.io.FileNotFoundException: File
output/clusters-0/part-00000/* does not exist.
java.io.FileNotFoundException: File output/clusters-0/part-00000/*
does not exist.

ocurrs first at:

org.apache.mahout.clustering.kmeans.KMeansDriver.isConverged(KMeansDriver.java:298)

which corresponds to:

  private static boolean isConverged(String filePath, JobConf conf,
FileSystem fs)
      throws IOException {
    Path outPart = new Path(filePath + "/*");
    SequenceFile.Reader reader = new SequenceFile.Reader(fs, outPart,
conf);  <-- THIS
    ...
  }

where isConverged() is called in this fashion:

return isConverged(clustersOut + "/part-00000", conf, fs);

by runIteration(), which is previously invoked by runJob() like:

     String clustersOut = output + "/clusters-" + iteration;
      converged = runIteration(input, clustersIn, clustersOut, measureClass,
          delta, numReduceTasks, iteration);

Consequently, assuming its the first iteration and the output folder
has been named "output" by the user, the SequenceFile.Reader receives
"output/clusters-0/part-00000/*" as a path, which is non-existent. I
believe the path should end in "part-00000" and the  + "/*" should be
removed... although someone, evidently, thought otherwise.

Any feedback?

On Mon, Jul 6, 2009 at 5:39 PM, nfantone<nfantone@gmail.com> wrote:
> I was using Canopy to create input clusters, but the error appeared
> while running kMeans (if I run kMeans' job only with previously
> created clusters from Canopy placed in output/canopies as initial
> clusters, it still fails). I noticed no other problems. I was using
> revision 790979 before updating.  Strangely, there were no changes in
> the job and drivers class from that revision. svn diff shows that the
> only classes that changed in org.apache.mahout.clustering.kmeans
> package were KMeansInfo.java and RandomSeedGenerator.java
>
> On Mon, Jul 6, 2009 at 3:55 PM, Jeff Eastman<jdog@windwardsolutions.com> wrote:
>> Hum, no, it's looking for the output of the first iteration. Were there
>> other errors? What was the last revision you were running? It does look like
>> something got horked, as it should be looking for output/clusters-0/*. Can
>> you diff the job and driver class to see what changed?
>>
>> Jeff
>>
>> nfantone wrote:
>>>
>>> Fellows, today I updated to revision 791558 and while running kMeans I
>>> got the following exception:
>>>
>>> WARNING: java.io.FileNotFoundException: File
>>> output/clusters-0/part-00000/* does not exist.
>>> java.io.FileNotFoundException: File output/clusters-0/part-00000/*
>>> does not exist.
>>>
>>> The algorithm isn't interrupted, though. But this exception wasn't
>>> thrown before the update and, to me, its message is not quite clear.
>>> It seems as it's looking for any file inside a "part-00000" directory,
>>> which doesn't exist; and, as far as I know, "part-xxxxx" are default
>>> names for output files.
>>>
>>> I could show the entire stack trace, if needed. Any pointers?
>>>
>>>
>>> On Thu, Jul 2, 2009 at 3:16 PM, nfantone<nfantone@gmail.com> wrote:
>>>
>>>>
>>>> Thanks for the feedback, Jeff.
>>>>
>>>>
>>>>>
>>>>> The logical format of input to KMeans is <Key, Vector> as it is
in
>>>>> sequence
>>>>> file format, but the Key is never used. To my knowledge, there is no
>>>>> requirement to assign identifiers to the input points*. Users are free
>>>>> to
>>>>> associate an arbitrary name field with each vector - also label mappings
>>>>> may
>>>>> be assigned - but these are not manipulated by KMeans or any of the
>>>>> other
>>>>> clustering applications. The name field is now used as a vector
>>>>> identifier
>>>>> by the KMeansClusterMapper - if it is non-null - in the output step
>>>>> only.
>>>>>
>>>>
>>>> The key may not be used internally, but externally they can prove to
>>>> be pretty useful. For me, keys are userIDs and each Vector represents
>>>> his/her historical behavior. Being able to collect the output
>>>> information as <UserID, ClusterID> is quite neat as it allows me to,
>>>> for instance, retrieve user information using data directly from a
>>>> HDFS file's field.
>>>>
>>>>
>>>
>>>
>>>
>>
>>
>

Mime
View raw message