Mailing-List: contact user-help@giraph.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@giraph.apache.org
Received-SPF: pass (athena.apache.org: local policy)
Content-Type: text/plain; charset=us-ascii
Mime-Version: 1.0 (Apple Message framework v1084)
Subject: Re: Multiple jobs on same graph, aggregator use and LocalRunner issue
From: Benjamin Heitmann <benjamin.heitmann@deri.org>
In-Reply-To: <1338931310.2792.20.camel@clivelt2>
Date: Wed, 6 Jun 2012 03:10:03 +0100
Content-Transfer-Encoding: quoted-printable
Message-Id: <763BF0F6-2E00-47D3-9F80-14BC3EDF3801@deri.org>
References: <1338931310.2792.20.camel@clivelt2>
To: user@giraph.apache.org


Hi Clive,=20

On 5 Jun 2012, at 22:21, Clive Cox wrote:
>=20
> I recently started playing with Giraph and I have a few questions.
>=20
> 1. I'm writing a simple spreading activation algorithm

I am also working on a spreading activation algorithm.=20
My original data is in the form of an RDF graph, which has typed edges =
and vertices,=20
which is pretty far away from the kind of pagerank algorithm for which =
Google Pregel
and thus Apache Giraph is optimised for.=20

So I can understand your questions very well.=20


> which would be
> run many times over the same graph with different initial vertices
> activated. Doing this as separate jobs in which a potentially large
> graph is loaded each time will be slow. Is there a way to run multiple
> BSP runs over the same loaded graph?=20

Sadly this is not possible currently AFAIK. The Hadoop paradigm is =
focused on=20
on jobs with a transient graph.=20

But I think if enough people speak up to point out how ineffecient it is =
to just throw away the graph between jobs,=20
maybe some sort of mechanism can be added for running the same algorithm =
with different "configurations"
on the same graph.=20

I need to run the same algorithm on the same graph for different user =
profiles ("different configurations"),=20
and it was a big challenge to run all of those configurations in =
parallel in just one run. For my case,=20
building the graph takes between 1/3 and 1/4 of the total processing =
time

> 2. I might want to normalise the vertex values at the end of a
> superstep. I assume I can use an aggregator to get the sum of the =
values
> but I'm not sure where can I update all vertex values before the next
> superstep?

The best place right now to add some coordinating logic based on a =
knowledge about the whole graph,=20
is in the WorkerContext, specifically in the pre-superstep method.=20

In the compute method of a vertex, you can add a value to a Sum/LongSum =
Aggregator.
Then in the pre-superstep method of the WorkerContext you can check the =
value of that aggregator.=20
Then you can either re-set that same aggregator, or you can set another =
aggregator. Then in the next superstep
the vertices will need to check that aggregator and retrieve the new =
normalised value.

Somebody started to work on a patch for a centralised master which will =
be able to control/coordinate the whole graph,=20
but nothing has been finished for that. The Jira issue is here: =
https://issues.apache.org/jira/browse/GIRAPH-127

> 3. On a smaller trivial point: Running within a LocalRunner for
> debugging I need to delete the local zookeeper state created in _bsp*
> folders otherwise the next run does nothing as its assumes its the =
same
> state and just finishes straight away.=20

I never had that issue, so I cant comment on that.=20=