Mahout may need to extract the random sampling as an independent class
since it's a very often used algorithm.
On Sat, Jan 12, 2013 at 8:30 PM, Sean Owen <srowen@gmail.com> wrote:
> I'm sure someone fixed something like this a while ago but yes I still
> see this in the code. Search JIRA and file a bug
>
> On Sun, Jan 13, 2013 at 2:20 AM, sam wu <swu5530@gmail.com> wrote:
>> Sorry for my previous email, apparently not yet finished but get sent out.
>>
>> Here is the complete one
>>
>> Similar to MIA ch09 RandomPointUtil.java, random selection is not uniform
>> random.
>>
>> org.apache.mahout.clustering.kmeans.RandonSeedGenerator.java problem
>> 
>> line 96109
>>
>> if (currentSize < k) {
>>
>> //select
>>
>> } else if (random.nextInt(currentSize + 1) != 0) { // with chance
>> 1/(currentSize+1) pick new element
>>
>> int indexToRemove = random.nextInt(currentSize); // evict one
>> chosen randomly
>>
>> // replace with new
>>
>> }
>>
>> 
>>
>> again this is not uniform random.
>>
>> later sample will get much higher probability to be selected than beginning
>> sample.
>>
>> because currentSize stay to be k after initial k samples. and new sample
>> will be picked with 1/(k+1) probability.
>>
>> So, all ending samples will be selected with much higher prob.
>>
>> In case of 1000 samples, k=3 , most likely selected 3 samples will be > 980
>>
>>
>> Sam
>>
>> On Sat, Jan 12, 2013 at 6:07 PM, sam wu <swu5530@gmail.com> wrote:
>>
>>> Similar to MIA ch09 RandomPointUtil.java, random selection is not uniform
>>> random.
>>>
>>> org.apache.mahout.clustering.kmeans.RandonSeedGenerator.java problem
>>>
>>> line 96109
>>>
>>> if (currentSize < k) {
>>>
>>> //select
>>>
>>> } else if (random.nextInt(currentSize + 1) != 0) { // with
>>> chance 1/(currentSize+1) pick new element
>>>
>>> int indexToRemove = random.nextInt(currentSize); // evict one
>>> chosen randomly
>>>
>>> // replace with new
>>>
>>> }
>>>

Shengchao Ding
