giraph-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eli Reisman (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (GIRAPH-273) Aggregators shouldn't use Zookeeper
Date Fri, 31 Aug 2012 16:30:07 GMT

    [ https://issues.apache.org/jira/browse/GIRAPH-273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13446099#comment-13446099
] 

Eli Reisman commented on GIRAPH-273:
------------------------------------

That makes a lot of sense, maybe thats the right way to go then. What I was thinking (just
for reference) is along these lines:

We aggregate all values in aggregators at each worker during compute() cycle, so really we
have total messages per aggregation at each superstep come to (# of workers) * (# of aggregators
in that application.)

We set a single ZK node at the end of each superstep that the master creates once all workers
have put up nodes to say they are done with that superstep. When this new node appears, workers
start sending their aggregated values from that superstep. They have their own Worker ID number
already, and they can get "-w" (the total workers in the application run) from Configuration.
So then they have a sort of heap swim() function that takes these two values and gives back
their parent node in the tree. Since all other network activity has ceased for a moment, passing
messages along should not be too expensive compared to the volumes we send during the work
phases of a superstep already. If they get aggregator messages, they pass them to parent.
It gets a bit busy at the top of the tree, but even then our typical messaging should be much
more volume so it ought to be ok if we got this far without crashing already?

So...at the top worker 1, 2 report to the top of the heap, which is the master (worker 0)
and that is where all the final aggregating takes place, since the master has nothing to do.
Alternately, the top couple nodes in the tree (as determined by their height in the tree)
might do some sub aggregating to cut down on message volume. This could be set up to whatever
tests the best (probably some sub aggregating)

Finally, when the master gets (# of workers) * (# of aggregators) values (or with sub-agg,
2 messages * # of aggregators) then it writes to that znode a child that says "time to move
on with the superstep" and we go forward. If we pass a timeout without hearing from everyone,
retry or app fail etc.

This means no new connections except a single one to master from nodes 1 & 2 which is
nice. We would love to scale up further into the 4 figures and the # of connections maintained
per worker is starting to become a bit of a problem. I definitely agree doing the work at
the master when possible for aggregators (as with the 1 connect from each worker method) is
good because the master is not busy in our current scheme.

Anyway, great work on the other sections so far, thats a lot of code to write! Looking forward
to this!

                
> Aggregators shouldn't use Zookeeper
> -----------------------------------
>
>                 Key: GIRAPH-273
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-273
>             Project: Giraph
>          Issue Type: Improvement
>            Reporter: Maja Kabiljo
>            Assignee: Maja Kabiljo
>
> We use Zookeeper znodes to transfer aggregated values from workers to master and back.
Zookeeper is supposed to be used for coordination, and it also has a memory limit which prevents
users from having aggregators with large value objects. These are the reasons why we should
implement aggregators gathering and distribution in a different way.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message