mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Random Selection Algorithm Problem in org.apache.mahout.clustering.kmeans.RandomSeedGenerator
Date Fri, 16 Dec 2011 05:31:38 GMT
Lijie,

You are correct.  This code is in error.  The mailing list lost your
coloring, but your point is still there.

I think that the code should be this instead.  Ironically, the comment in
the original code describes what the code does accurately.

int itemsSeenSoFar = 0;
for (Pair<Writable,VectorWritable> record
 : new SequenceFileIterable<Writable,VectorWritable>(fileStatus.getPath(),
true, conf)) {
    itemsSeenSoFar++;
    Writable key = record.getFirst();
    VectorWritable value = record.getSecond();
    Cluster newCluster = new Cluster(value.get(), nextClusterId++, measure);
    newCluster.observe(value.get(), 1);
    Text newText = new Text(key.toString());
    int currentSize = chosenTexts.size();
    if (currentSize < k) {
chosenTexts.add(newText);
chosenClusters.add(newCluster);
    } else {
int itemToReplace = random.nextInt(itemsSeenSoFar);
chosenTexts.set(itemToReplace, newText);
 chosenClusters.set(itemToReplace, newCluster);
    }
}

On Thu, Dec 15, 2011 at 9:10 PM, Lijie Xu <csxulijie@gmail.com> wrote:

> Hi, I'm now reading the source code of "org.apache.mahout.clustering.**
> kmeans.RandomSeedGenerator".
> There may be a problem in function "buildRandom" which aims to select the
> random k centroid vectors from streaming records.
>
> I'm wondering whether this algorithm is correct and I think the right
> algorithm is as follows:
>
> To select the k elements from streaming resource, we put the first k
> elements (i=1,2,...,k) into the buffer.
> When the /i/th (i > k) element comes, we want to keep it with the
> probability of k/i.
> If the /i/th element is selected, then we can randomly delete an element
> from the buffer and add the /i/th element into the buffer.
>
> So I think the code with red color is doubtable. If I'm wrong, please tell
> me. Thx!
>
> for (Pair<Writable,VectorWritable> record
>             : new SequenceFileIterable<Writable,**
> VectorWritable>(fileStatus.**getPath(), true, conf)) {
>          Writable key = record.getFirst();
>          VectorWritable value = record.getSecond();
>          Cluster newCluster = new Cluster(value.get(), nextClusterId++,
> measure);
>          newCluster.observe(value.get()**, 1);
>          Text newText = new Text(key.toString());
>          int currentSize = chosenTexts.size();
>          if (currentSize < k) {
>            chosenTexts.add(newText);
>            chosenClusters.add(newCluster)**;
>          } else if (random.nextInt(currentSize + 1) == 0) { // with chance
> 1/(currentSize+1) pick new element
>            int indexToRemove = random.nextInt(currentSize); // evict one
> chosen randomly
>            chosenTexts.remove(**indexToRemove);
>            chosenClusters.remove(**indexToRemove);
>            chosenTexts.add(newText);
>            chosenClusters.add(newCluster)**;
>          }
>        }
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message