mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robin Anil <robin.a...@gmail.com>
Subject Re: PFPGrowth on cluster does not distribute work load equally on nodes
Date Wed, 23 Jun 2010 13:01:45 GMT
PFPGrowth job does not set the number of maps or reduces. The reason for
maps being 1 could be due to the data being smaller. See, the Transaction
sorting job converts data into integers and you might see a drastic
reduction in dataset size(see output/sortedoutput to check the compressed
data size). If you need more maps for that small dataset, you can decrease
the split size. But, its not really worth the effort as the map job is doing
a simple 1 pass over the data. PFPGrowth Reducer on the other hand is where
the actual data mining computation take place. If you find that part is not
scaling, then tell me. Otherwise you might have to spend some time in tuning
your cluster using  scores of knobs hadoop provides.

Robin

On Wed, Jun 23, 2010 at 6:17 PM, <jacobz@gmx.de> wrote:

> Hallo Robin.
> Thank you for you answer.
>
> I still have troubles getting it to work as I want.
> I have about 400 unique features in a big dataset. (US Census 1990 dataset
> 30% sample, preprocessed)
>
> I tried to set the number of groups to 4, 40 or 400. The Map-capacity of my
> cluster is 20. (10 nodes, 2 maps per node) But the "PFPGrowth"-Job always
> only uses one single map job. The "Parallel Counting" and the "PFP
> Transaction Sorting" use 6 map job.
>
> In the job-configuration of the "PFPGrowth"-job the value for
> "mapred.map.tasks" is 1 although I have set it to 20.
>
> The number of map jobs correspond to the number of input-file-splits,
> right? So can I somehow force the input file to be split, or does this alter
> the result somehow?
>
> Or do you have another idea?
>
> Thanks a lot in advance, I tried for hours to get this somehow working.
>
> Björn
>
> > Hi Bjorn, The  distribution of data is in a skewed manner. Thats a
> problem
> > with the algorithm as proposed in the paper . The way around it is to
> > increase the number of groups parameter. For example, if you have 10K
> > unique
> > features, try to split it into groups such that there is around 10
> > features
> > per split. Each reducer finds the TopK patterns by creating FP-Trees
> > having
> > predominantly those 10 features. So set the number of groups as 1000
> >
> > Robin
> >
> > 2010/6/16 "Björn Jacobs" <jacobz@gmx.de>
> >
> > > Hallo everyone!
> > >
> > > I am trying to get used to the PFPGrowth in the Mahout packages. I am
> > > planning to adapt this code to be able to run a parallelized subgroup
> > > discovery. This is btw the aim of my bachelor thesis, which I am
> > currently
> > > writing.
> > >
> > > I'm having the problem that the algorithm does not distribute the work
> > load
> > > equally on the nodes in my cluster. I have 10 nodes and I set the
> > > mapred.map.tasks=15 as well as the mapred.reduce.tasks variable.
> > >
> > > My problem is, that the "PFP Growth Driver running over
> > > input/test002/sortedoutput"-Job did the following:
> > >
> > > Node 0 got nearly 100% of the work (finished in 20 minutes)
> > > Node 1-3 got a very small piece (finished in less than 10 seconds)
> > > Node 4-14 got nothing and finished execution immediately
> > >
> > > This way one node had to do all the work while the others had nothing
> to
> > do
> > > and the job took really long to finish... that's not parallel.
> > >
> > > Is this a bug or do I have to configure something to get this working?
> > > Thanks a lot!
> > >
> > > Yours,
> > > Björn Jacobs
> > > --
> > > GMX DSL: Internet-, Telefon- und Handy-Flat ab 19,99 EUR/mtl.
> > > Bis zu 150 EUR Startguthaben inklusive!
> http://portal.gmx.net/de/go/dsl
> > >
>
> --
> GMX DSL: Internet-, Telefon- und Handy-Flat ab 19,99 EUR/mtl.
> Bis zu 150 EUR Startguthaben inklusive! http://portal.gmx.net/de/go/dsl
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message