hama-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shuo Wang <ecisp.wangs...@gmail.com>
Subject Re: PageRank Experiment Iteration
Date Thu, 25 Oct 2012 03:23:31 GMT
Hi,

I have changed the program  that creates random input for SSSP to generate
random graph for pagerank.My cluster has 10 nodes,each node has 8G
memory;45 tasks,each task has 512M memory;I set the groom memory to 2000M.
Now the largest data I can run is 133M, the larger data will output
"OUTOFMEMORY" error.

What' more.I find it also relates with the number of the Vertices and
Edges.For example,500000 vertices,32 edges can run and have right
result;1000000 vertices,4 edges run failed or the result is NULL or
infinity.

2012/10/24 Thomas Jungblut <thomas.jungblut@gmail.com>

> 512mb is rough. normally a datanodes consumes 1gb of memory. So if you
> start a groom on that, there is not much room for it. So I don't think it
> will run on these machines pretty well (if ever).
> Pretty much nothing is disk based here, so you need memory to scale out
> (unlike in mapreduce). However we want to enable this, but it takes more
> time to get through it.
>
> I have written a rough sketch of what should be done to make Hama more
> scalable:
>
> https://docs.google.com/document/d/1Fud5zSFuKDAEz3E8T59ldZtg1H-IMx2CQGbn_bib_eA/edit
>
> But this is future work and maybe not all of it really scales well for many
> more machines.
>
> 2012/10/24 Shuo Wang <ecisp.wangshuo@gmail.com>
>
> > I have tried it on our cluster as you say,but the result is wrong,there
> is
> > no score of the nodes. The same error as I have before.
> >
> > 2012/10/24 Shuo Wang <ecisp.wangshuo@gmail.com>
> >
> > > Thank you,let me try!
> > >
> > >
> > > 2012/10/24 Thomas Jungblut <thomas.jungblut@gmail.com>
> > >
> > >> Yes I generated it for an algorithm from movie actors (to calculate
> > Kevin
> > >> Bacon numbers).
> > >> However like I already told you, you can rewrite the generator
> mapreduce
> > >> job that creates random input for SSSP:
> > >>
> > >>
> >
> https://github.com/thomasjungblut/thomasjungblut-common/blob/master/src/de/jungblut/bsp/RandomGraphGenerator.java
> > >>
> > >> Basically you have to remove the weights from outputting RandomMapper.
> > >> So instead of
> > >>
> > >> s += Long.toString(rowId) + ":" + rand.nextInt(100) + "\t";
> > >> >
> > >> > You would do:
> > >>
> > >> > s += Long.toString(rowId) + "\t";
> > >> >
> > >> >  Of course you can also use a Stringbuilder instead of +=, but
> String
> > >> concat usually isn't a bottleneck in MapReduce ;))
> > >>
> > >> 2012/10/24 Shuo Wang <ecisp.wangshuo@gmail.com>
> > >>
> > >> > Do you generate the data yourself? Can you provide the data
> generator
> > >> for
> > >> > me?
> > >> >
> > >> > 2012/10/24 Thomas Jungblut <thomas.jungblut@gmail.com>
> > >> >
> > >> > > 12 gigs, it uses several more (up to 10?) times the memory than
> the
> > >> > dataset
> > >> > > size.
> > >> > >
> > >> > > 2012/10/24 Shuo Wang <ecisp.wangshuo@gmail.com>
> > >> > >
> > >> > > > How large your data is? Our cluster has 10 nodes, 45 tasks,
each
> > >> task
> > >> > has
> > >> > > > 512M memory. But when I run the 200M data, it has OUTOFMEMORY
> > >> failure.
> > >> > > >
> > >> > > > 2012/10/24 Thomas Jungblut <thomas.jungblut@gmail.com>
> > >> > > >
> > >> > > > > Sure it does run, if you have enough ram ;)
> > >> > > > >
> > >> > > > > 2012/10/24 Shuo Wang <ecisp.wangshuo@gmail.com>
> > >> > > > >
> > >> > > > > > How much data have you run the pagerank on HAMA?
Does it
> run?
> > I
> > >> > want
> > >> > > to
> > >> > > > > run
> > >> > > > > > large data for pagerank on HAMA, but it always
fails.
> > >> > > > > >
> > >> > > > > > 2012/10/24 Thomas Jungblut <thomas.jungblut@gmail.com>
> > >> > > > > >
> > >> > > > > > > Yes it works on any directed graph.
> > >> > > > > > > The best format to use is
> > >> > > > > > >
> > >> > > > > > > Vertex <\t> AdjacentVertex1 <\n>
AdjacentVertex2 etc.
> > >> > > > > > >
> > >> > > > > > > So you have a adjacency list, and a vertex
is represented
> by
> > >> each
> > >> > > > line.
> > >> > > > > > > This is splittable, which the web-google
dataset is not.
> > >> > > > > > >
> > >> > > > > > > 2012/10/24 Shuo Wang <ecisp.wangshuo@gmail.com>
> > >> > > > > > >
> > >> > > > > > > > Thanks! Does the pagerank work on any
web graph? I
> > generate
> > >> a
> > >> > > > random
> > >> > > > > > web
> > >> > > > > > > > graph just like the data type of web-Google.txt,
but the
> > >> result
> > >> > > is
> > >> > > > > > > > infinity.
> > >> > > > > > > >
> > >> > > > > > > > 2012/10/24 Thomas Jungblut <thomas.jungblut@gmail.com>
> > >> > > > > > > >
> > >> > > > > > > > > Because graph iterations != supersteps.
You have to
> take
> > >> the
> > >> > > > > > > partitioning
> > >> > > > > > > > > into account, the time to accumulate
the number of
> > >> vertices.
> > >> > > > > Pagerank
> > >> > > > > > > > > requires an additional superstep
to run aggregators.
> > >> > > > > > > > >
> > >> > > > > > > > > 2012/10/24 Shuo Wang <ecisp.wangshuo@gmail.com>
> > >> > > > > > > > >
> > >> > > > > > > > > > Hi,
> > >> > > > > > > > > >
> > >> > > > > > > > > > I have run the pagerank on
HAMA, I set the max
> > >> iteration to
> > >> > > 20,
> > >> > > > > but
> > >> > > > > > > it
> > >> > > > > > > > > run
> > >> > > > > > > > > > 48 supersteps. Why?
> > >> > > > > > > > > >
> > >> > > > > > > > >
> > >> > > > > > > >
> > >> > > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message