mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Eastman <>
Subject Re: Help regarding ClusterOutputPostProcessor
Date Wed, 25 Jan 2012 15:21:41 GMT
Mean Shift accumulates the pointIds of every point assigned to a 
cluster, so I would expect n= to be correct in the cluster dumper 
output. It is most likely the postprocessor is misbehaving. Please 
create a JIRA and attach your dataset and we will take a look at it.

It would also be useful for you to include the exact CLI commands which 
you used to duplicate this problem.

On 1/25/12 2:41 AM, gaurav redkar wrote:
> Hello,
> I was able to rectify the afore-mentioned problem after i implemented a
> custom partitioner instead of using the default hash partitioner.  I have
> another issue though. After running the post processor the number of points
> that each cluster contains is not matching the number of points each
> cluster should contain as stated by clusterdumper.
> MSV-287{ n=90 c=[0.05195, 0.05675, 0.07151, 0.05713, 0.06946,...}
> MSV-145{ n=90 c=[0.93685, 0.93071, 0.93641, 0.94629, 0.94409,..}
> the n mentioned in clusters-n-final against each cluster is different from
> the number of points actually contained in d directory for each cluster.
> Any idea why is this happening ...?
> PS: the dataset on which i tested the algorithm has 1000 records with 200
> attributes per record. I can share the dataset that i have used if needed.
> Thanks,
> Gaurav
> On Fri, Jan 6, 2012 at 6:12 PM, Paritosh Ranjan<>  wrote:
>> ClusterOutputProcessorDriver has options to run either sequentially or in
>> a mapreduce way.
>> If the clustering was done sequetially, then ClusterOutputProcessor should
>> be run sequentially, and if the clustering was done in a mapreduce way,
>> then run the ClusterOutputPostProcessor with option mapreduce=true.
>> If you have already tried this, and its still now working, then filing a
>> bug (as Lance mentioned) would be appropriate.
>> On 06-01-2012 17:18, gaurav redkar wrote:
>>>   Hello,
>>> wen I ran the ClusterOutputPostProcessor on synthetic_control_data in
>>> mapreduce mode, I observed that one directory contained points belonging to
>>> 2 other clusters and the directories relating to those 2 clusters were not
>>> created as their "part- *" files were empty and the function "**
>>> movePartFilesToRespectiveDirec**tories()" was not able to create the
>>> directories to put them into. I have converted the sequence file containing
>>> the points belonging to those 3 clusters into text file(by changing the
>>> output format to TextOutputFormat). Kindly find the attached part-file
>>> which can be viewed.
>>> Any suggestions as to why this might be happening...?
>>> Note: The program runs fine in sequential mode.
>>> Thanks.
>>> No virus found in this message.
>>> Checked by AVG -<>
>>> Version: 10.0.1416 / Virus Database: 2109/4125 - Release Date: 01/05/12

  • Unnamed multipart/mixed (inline, None, 0 bytes)
View raw message