hama-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Suraj Menon <menonsur...@gmail.com>
Subject Re: [DISCUSS]Helix for global synchronization in Hama
Date Fri, 10 May 2013 03:48:58 GMT
Hi Kishore,

Nice to see your post here. The code refactor for Synchronization service
was done to both regulate and differentiate what could be done on Zookeeper
or any other synchronization service by a peer (Container each running one
task of the job) and at BSPMaster for the job. It is still more biased
towards Zookeeper API's though :) . This differentiation is done between
BSPMasterSyncClient and BSPPeerSyncClient. The purpose and their use is
same for YARN or non-YARN mode.

Fault tolerance is based on checkpointing the messages sent among peers in
a job. At end of every n (configurable) supersteps, all the messages
received by a peer is checkpointed and saved into HDFS. The peers keep an
account in Zookeeper about the last superstep for which they have
checkpointed the messages. On failure, BSPMaster finds the smallest
superstep number for which there are checkpoints saved for all peers, and
restarts the tasks from that superstep along with the new recovery task.
Obviously this is not industrial strength solution. We have to save the
state of each peer that is essential for computation. Also, since each
message is written to HDFS, the performance goes for a toss. With the new
spilling queue, the messages do get persisted to a file and we would be
looking into improving the performance from there. The current code is in
o.a.hama.ft package.

After we met, I did go through a bit of Helix code. Maybe a stupid
question, but I have to ask :) ; the different recipes suggests a model of
handling state transitions in Helix. But when we are talking about a fault
tolerance within a job, does it mean we should have a separate instance of
Helix Controller for a job. So for 2 jobs submitted, would we have two
Helix controllers that would implement a custom state transition model
defined for fault tolerance? I need some more time to understand Helix.

I don't see any cons for Hama to integrate with Helix, apart from
the dependency on a new system. We can create a new helix-hama module and
have pluggable implementation in Helix in the module to begin with.

Regards,
Suraj



On Thu, May 9, 2013 at 12:51 PM, kishore g <g.kishore@gmail.com> wrote:

> Thanks Edward,
>
> I looked at the code and it looks like its nicely abstracted. I see some
> comments in the code that say this happens only in YARN. Can you give me
> some additional info on what is the difference when running with YARN.
>
> Another thing I wanted to check is what happens when a node fails, is the
> entire job restarted or just super step or just the sub task of the super
> step. I am interested in the current behavior and what would be nice to
> have.
>
> Is there a document that describes the internal architecture.
>
> Thanks,
> Kishore G
>
>
>
>
> On Wed, May 8, 2013 at 6:21 PM, Edward J. Yoon <edwardyoon@apache.org
> >wrote:
>
> > Hi,
> >
> > This would be great collaboration. Since we pursue the pluggable
> > interfaces for managing the synchronization[1], messenger, and job
> > scheduling systems (we want to preserve the classic (standalone)
> > cluster mode, while integrating with resource manager systems), the
> > integration with Helix won't be difficult.
> >
> > 1. http://wiki.apache.org/hama/SyncService
> >
> > On Thu, May 9, 2013 at 7:01 AM, kishore g <g.kishore@gmail.com> wrote:
> > > Hello,
> > >
> > > I am starting a discussion thread on potential pros/cons of using Helix
> > in
> > > Hama. I dont know the internal details of Hama, so please correct me if
> > > something does not make sense.
> > >
> > > My source of information is
> http://wiki.apache.org/hama/Architectureand a
> > > brief chat with Suraj at ApacheCon where he described the need for
> > barriers
> > > between super steps.
> > >
> > > Please read about Apache Helix here http://helix.incubator.apache.org/
> .
> > >
> > > Architecture wise Helix maps pretty well with the components in Hama.
> > > HelixController can be wrapped inside BSPMaster and GroomServer is the
> > > PARTICIPANT in Helix terminology that wraps Helix Agent.
> > >
> > > The partitioning and assigning tasks to GroomServers can be done via
> > Helix
> > > Apis, it basically boils down to setting the idealstate for a
> particular
> > > stage. Starting of the next step which basically depends on all tasks
> in
> > > previous step being completed can be done by watching the ExternalView.
> > >
> > > In the architecture wiki, I see that there is plan to integrate with
> > > Zookeeper for fault tolerance. Helix internally uses Zookeeper to store
> > the
> > > cluster state. So it might make it easier to make the tasks fault
> > tolerant
> > > and probably restartable as well at a task level instead of job/stage
> > level.
> > >
> > > We recently added a recipe in Helix to demonstrate the concept of
> > > dependency between resources.
> > >
> > > http://helix.incubator.apache.org/recipes/task_dag_execution.html
> > > Code:
> > >
> >
> https://github.com/apache/incubator-helix/tree/master/recipes/task-execution/src/main/java/org/apache/helix/taskexecution
> > >
> > > Let me know your thoughts.
> > >
> > > thanks,
> > > Kishore G
> >
> >
> >
> > --
> > Best Regards, Edward J. Yoon
> > @eddieyoon
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message