mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From praneet mhatre <praneetmha...@gmail.com>
Subject Re: Help regarding ClusterOutputPostProcessor
Date Thu, 26 Apr 2012 22:10:17 GMT
Hi,

I had a look at the JIRA and looks like the issue is still unresolved. I
wanted to know if the suggestion that the postprocessor may be at fault has
been verified.

I am using Dirichlet clustering for a project of mine and I also noticed
the mismatch between the number of points actually present in the cluster
and the value of n. I was wondering if the clusteredPoints directory
contains the correct point assignment and if I could just use that for the
purpose of my project.

Thanks!

On Mon, Jan 30, 2012 at 8:25 PM, gaurav redkar <gauravredkar@gmail.com>wrote:

> Hello. As Jeff mentioned, i created a JIRA issue. Kindly check out
> MAHOUT-966 <https://issues.apache.org/jira/browse/MAHOUT-966>  and share
> your inputs.
>
> Thanks,
> Gaurav
>
> On Wed, Jan 25, 2012 at 8:51 PM, Jeff Eastman <jdog@windwardsolutions.com
> >wrote:
>
> > Mean Shift accumulates the pointIds of every point assigned to a cluster,
> > so I would expect n= to be correct in the cluster dumper output. It is
> most
> > likely the postprocessor is misbehaving. Please create a JIRA and attach
> > your dataset and we will take a look at it.
> >
> > It would also be useful for you to include the exact CLI commands which
> > you used to duplicate this problem.
> >
> >
> > On 1/25/12 2:41 AM, gaurav redkar wrote:
> >
> >>  Hello,
> >>
> >> I was able to rectify the afore-mentioned problem after i implemented a
> >> custom partitioner instead of using the default hash partitioner.  I
> have
> >> another issue though. After running the post processor the number of
> >> points
> >> that each cluster contains is not matching the number of points each
> >> cluster should contain as stated by clusterdumper.
> >>
> >>
> >> MSV-287{ n=90 c=[0.05195, 0.05675, 0.07151, 0.05713, 0.06946,...}
> >>
> >> MSV-145{ n=90 c=[0.93685, 0.93071, 0.93641, 0.94629, 0.94409,..}
> >> the n mentioned in clusters-n-final against each cluster is different
> from
> >> the number of points actually contained in d directory for each cluster.
> >> Any idea why is this happening ...?
> >>
> >> PS: the dataset on which i tested the algorithm has 1000 records with
> 200
> >> attributes per record. I can share the dataset that i have used if
> needed.
> >>
> >> Thanks,
> >>
> >> Gaurav
> >>
> >> On Fri, Jan 6, 2012 at 6:12 PM, Paritosh Ranjan<pranjan@xebia.com>
> >>  wrote:
> >>
> >>  ClusterOutputProcessorDriver has options to run either sequentially or
> >>> in
> >>> a mapreduce way.
> >>>
> >>> If the clustering was done sequetially, then ClusterOutputProcessor
> >>> should
> >>> be run sequentially, and if the clustering was done in a mapreduce way,
> >>> then run the ClusterOutputPostProcessor with option mapreduce=true.
> >>>
> >>> If you have already tried this, and its still now working, then filing
> a
> >>> bug (as Lance mentioned) would be appropriate.
> >>>
> >>>
> >>> On 06-01-2012 17:18, gaurav redkar wrote:
> >>>
> >>>   Hello,
> >>>> wen I ran the ClusterOutputPostProcessor on synthetic_control_data in
> >>>> mapreduce mode, I observed that one directory contained points
> >>>> belonging to
> >>>> 2 other clusters and the directories relating to those 2 clusters were
> >>>> not
> >>>> created as their "part- *" files were empty and the function "**
> >>>> movePartFilesToRespectiveDirec****tories()" was not able to create the
> >>>>
> >>>> directories to put them into. I have converted the sequence file
> >>>> containing
> >>>> the points belonging to those 3 clusters into text file(by changing
> the
> >>>> output format to TextOutputFormat). Kindly find the attached part-file
> >>>> which can be viewed.
> >>>> Any suggestions as to why this might be happening...?
> >>>> Note: The program runs fine in sequential mode.
> >>>> Thanks.
> >>>>
> >>>>
> >>>> No virus found in this message.
> >>>> Checked by AVG - www.avg.com<http://www.avg.com**>
> >>>> Version: 10.0.1416 / Virus Database: 2109/4125 - Release Date:
> 01/05/12
> >>>>
> >>>>
> >>>>
> >
>



-- 
Praneet Mhatre
Graduate Student
Donald Bren School of ICS
University of California, Irvine

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message