mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paritosh Ranjan <pran...@xebia.com>
Subject Re: Difference in results : Clustering : sequential and MapReduce
Date Mon, 03 Oct 2011 07:00:26 GMT
I am implementing the functionality to distribute similar records on 
similar nodes (mappers). This will work like preselection, and hence 
will enable us to use the clusterFilter, which is a great performance 
enhancer, without any decrease in quality.

I will try to provide a patch for this.

On 03-10-2011 11:26, Paritosh Ranjan wrote:
> I got the reason for difference.
> Actually, its due to
>
> if (canopy.getNumPoints()>  clusterFilter)
>
>
> in CanopyMapper.
>
> Similar data is not distributed evenly in the mappers. So, the 
> canopies might come out with points < clusterFilter which are not 
> processed further.
> But, this check is a great performance enhancer. I have experienced that.
>
> Maybe, distributing similar vectors on mappers might help to attain 
> both quality and performance.
>
>
> On 03-10-2011 09:29, Paritosh Ranjan wrote:
>> The sequential algorithm finds more/better clusters  than the 
>> mapreduce one.
>> There's not a huge difference, but the standalone one is better for 
>> sure.
>>
>> Thanks and Regards,
>> Paritosh
>>
>> On 03-10-2011 01:47, Konstantin Shmakov wrote:
>>> I'd assume that distributed and sequential algorithms shouldn't produce
>>> identical results. To start with, they differ in initial setup:
>>> -- In distributed algorithm each mapper deals with subset of data 
>>> and starts
>>> by picking up a random point, so N random points are picked up by N 
>>> mappers
>>> to start with.
>>> -- In sequential algorithm 1 mapper deals with all data and starts by
>>> picking up 1 random point.
>>> But for the data with real clusters both algorithms should produce 
>>> similar
>>> results.  How different are the results in your case?
>>>
>>> Thanks
>>> --Konstantin
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Sun, Oct 2, 2011 at 1:36 AM, Paritosh Ranjan<pranjan@xebia.com>  
>>> wrote:
>>>
>>>> Even run() of CanopyDriver, which takes only T1 and T2 is giving 
>>>> different
>>>> results for sequential and mapreduce.
>>>> This is preventing me from scaling up, as I need to run mapreduce 
>>>> on hadoop
>>>> to scale.
>>>>
>>>> Is anyone having any idea of this problem?
>>>>
>>>> On 02-10-2011 00:27, Paritosh Ranjan wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I am able to cluster correctly sequentially, using CanopyDriver.
>>>>>
>>>>> However, the same dataset, when processed as a MapReduce job, 
>>>>> where ( t1 =
>>>>> t3 and t2 = t4 and t1>t2) is not working. I am getting errors like

>>>>> Canopies
>>>>> are empty.
>>>>>
>>>>> I also tried to reduce the values of t3 and t4. But reducing it 
>>>>> either has
>>>>> no effect or gives meaningless results.
>>>>>
>>>>> Am I doing something wrong? or is there a bug somewhere?
>>>>>
>>>>> I feel that both, sequential and MapReduce should give similar 
>>>>> results.
>>>>> But, It is not happening.
>>>>>
>>>>> Thanks and Regards,
>>>>> Paritosh
>>>>>
>>>>>
>>>>> -----
>>>>> No virus found in this message.
>>>>> Checked by AVG - www.avg.com
>>>>> Version: 10.0.1410 / Virus Database: 1520/3932 - Release Date: 
>>>>> 10/01/11
>>>>>
>>>>
>>>
>>
>>
>>
>> -----
>> No virus found in this message.
>> Checked by AVG - www.avg.com
>> Version: 10.0.1410 / Virus Database: 1520/3933 - Release Date: 10/02/11
>
>
>
> -----
> No virus found in this message.
> Checked by AVG - www.avg.com
> Version: 10.0.1410 / Virus Database: 1520/3934 - Release Date: 10/02/11


Mime
View raw message