mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From gaurav redkar <gauravred...@gmail.com>
Subject Re: Help regarding ClusterOutputPostProcessor
Date Tue, 31 Jan 2012 04:25:35 GMT
Hello. As Jeff mentioned, i created a JIRA issue. Kindly check out
MAHOUT-966 <https://issues.apache.org/jira/browse/MAHOUT-966>  and share
your inputs.

Thanks,
Gaurav

On Wed, Jan 25, 2012 at 8:51 PM, Jeff Eastman <jdog@windwardsolutions.com>wrote:

> Mean Shift accumulates the pointIds of every point assigned to a cluster,
> so I would expect n= to be correct in the cluster dumper output. It is most
> likely the postprocessor is misbehaving. Please create a JIRA and attach
> your dataset and we will take a look at it.
>
> It would also be useful for you to include the exact CLI commands which
> you used to duplicate this problem.
>
>
> On 1/25/12 2:41 AM, gaurav redkar wrote:
>
>>  Hello,
>>
>> I was able to rectify the afore-mentioned problem after i implemented a
>> custom partitioner instead of using the default hash partitioner.  I have
>> another issue though. After running the post processor the number of
>> points
>> that each cluster contains is not matching the number of points each
>> cluster should contain as stated by clusterdumper.
>>
>>
>> MSV-287{ n=90 c=[0.05195, 0.05675, 0.07151, 0.05713, 0.06946,...}
>>
>> MSV-145{ n=90 c=[0.93685, 0.93071, 0.93641, 0.94629, 0.94409,..}
>> the n mentioned in clusters-n-final against each cluster is different from
>> the number of points actually contained in d directory for each cluster.
>> Any idea why is this happening ...?
>>
>> PS: the dataset on which i tested the algorithm has 1000 records with 200
>> attributes per record. I can share the dataset that i have used if needed.
>>
>> Thanks,
>>
>> Gaurav
>>
>> On Fri, Jan 6, 2012 at 6:12 PM, Paritosh Ranjan<pranjan@xebia.com>
>>  wrote:
>>
>>  ClusterOutputProcessorDriver has options to run either sequentially or
>>> in
>>> a mapreduce way.
>>>
>>> If the clustering was done sequetially, then ClusterOutputProcessor
>>> should
>>> be run sequentially, and if the clustering was done in a mapreduce way,
>>> then run the ClusterOutputPostProcessor with option mapreduce=true.
>>>
>>> If you have already tried this, and its still now working, then filing a
>>> bug (as Lance mentioned) would be appropriate.
>>>
>>>
>>> On 06-01-2012 17:18, gaurav redkar wrote:
>>>
>>>   Hello,
>>>> wen I ran the ClusterOutputPostProcessor on synthetic_control_data in
>>>> mapreduce mode, I observed that one directory contained points
>>>> belonging to
>>>> 2 other clusters and the directories relating to those 2 clusters were
>>>> not
>>>> created as their "part- *" files were empty and the function "**
>>>> movePartFilesToRespectiveDirec****tories()" was not able to create the
>>>>
>>>> directories to put them into. I have converted the sequence file
>>>> containing
>>>> the points belonging to those 3 clusters into text file(by changing the
>>>> output format to TextOutputFormat). Kindly find the attached part-file
>>>> which can be viewed.
>>>> Any suggestions as to why this might be happening...?
>>>> Note: The program runs fine in sequential mode.
>>>> Thanks.
>>>>
>>>>
>>>> No virus found in this message.
>>>> Checked by AVG - www.avg.com<http://www.avg.com**>
>>>> Version: 10.0.1416 / Virus Database: 2109/4125 - Release Date: 01/05/12
>>>>
>>>>
>>>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message