hama-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From kishore g <g.kish...@gmail.com>
Subject Re: [DISCUSS]Helix for global synchronization in Hama
Date Sat, 11 May 2013 18:38:03 GMT
+ helix-dev

Thanks Suraj for the info on synchronization service.I will look into
BSPMasterSyncClient
and BSPPeerSyncClient. I was not able to compile the trunk code couple of
days back. I will give it another shot, i am guessing HDFS is not a must to
run the example.

Regarding your question about Helix. Each job will be represented as a
resource in Helix. A job can have multiple sub tasks and each task can have
a state model associated with it e.g offline->online or start->load
data->process->finish. These tasks are assigned to different nodes by
simply setting the idealstate of the resource(job) in Helix. The assignment
algorithm is pluggable, this is similar to BestEffortDataLocalTaskAllocator
strategy in Hama. Once the idealstate is set, Helix will make sure that
they get translated into corresponding transitions and monitoring the
tasks. A controller manages one cluster, a cluster can consists of any
number of nodes and helix resources(jobs in case of hama).  A  controller
can automatically starts managing a job as soon as its added to the
cluster. Controller issues the appropriate transition to right participant,
the participant performs the actual task as part of this transition. If the
participant dies, controller can detect that and automatically re-assign
the tasks to remaining nodes or wait for another container to start and
assign the tasks to it.

Helix can be run as a library or a service depending on the use case. I
still need to understand how Hama works currently and ensure that
integrating with Helix will be useful. After that, i can probably write a
prototype of helix-hama module.

Is there irc channel for hama.

Thanks
Kishore G











On Thu, May 9, 2013 at 8:48 PM, Suraj Menon <menonsuraj5@gmail.com> wrote:

> Hi Kishore,
>
> Nice to see your post here. The code refactor for Synchronization service
> was done to both regulate and differentiate what could be done on Zookeeper
> or any other synchronization service by a peer (Container each running one
> task of the job) and at BSPMaster for the job. It is still more biased
> towards Zookeeper API's though :) . This differentiation is done between
> BSPMasterSyncClient and BSPPeerSyncClient. The purpose and their use is
> same for YARN or non-YARN mode.
>
> Fault tolerance is based on checkpointing the messages sent among peers in
> a job. At end of every n (configurable) supersteps, all the messages
> received by a peer is checkpointed and saved into HDFS. The peers keep an
> account in Zookeeper about the last superstep for which they have
> checkpointed the messages. On failure, BSPMaster finds the smallest
> superstep number for which there are checkpoints saved for all peers, and
> restarts the tasks from that superstep along with the new recovery task.
> Obviously this is not industrial strength solution. We have to save the
> state of each peer that is essential for computation. Also, since each
> message is written to HDFS, the performance goes for a toss. With the new
> spilling queue, the messages do get persisted to a file and we would be
> looking into improving the performance from there. The current code is in
> o.a.hama.ft package.
>
> After we met, I did go through a bit of Helix code. Maybe a stupid
> question, but I have to ask :) ; the different recipes suggests a model of
> handling state transitions in Helix. But when we are talking about a fault
> tolerance within a job, does it mean we should have a separate instance of
> Helix Controller for a job. So for 2 jobs submitted, would we have two
> Helix controllers that would implement a custom state transition model
> defined for fault tolerance? I need some more time to understand Helix.
>
> I don't see any cons for Hama to integrate with Helix, apart from
> the dependency on a new system. We can create a new helix-hama module and
> have pluggable implementation in Helix in the module to begin with.
>
> Regards,
> Suraj
>
>
>
> On Thu, May 9, 2013 at 12:51 PM, kishore g <g.kishore@gmail.com> wrote:
>
> > Thanks Edward,
> >
> > I looked at the code and it looks like its nicely abstracted. I see some
> > comments in the code that say this happens only in YARN. Can you give me
> > some additional info on what is the difference when running with YARN.
> >
> > Another thing I wanted to check is what happens when a node fails, is the
> > entire job restarted or just super step or just the sub task of the super
> > step. I am interested in the current behavior and what would be nice to
> > have.
> >
> > Is there a document that describes the internal architecture.
> >
> > Thanks,
> > Kishore G
> >
> >
> >
> >
> > On Wed, May 8, 2013 at 6:21 PM, Edward J. Yoon <edwardyoon@apache.org
> > >wrote:
> >
> > > Hi,
> > >
> > > This would be great collaboration. Since we pursue the pluggable
> > > interfaces for managing the synchronization[1], messenger, and job
> > > scheduling systems (we want to preserve the classic (standalone)
> > > cluster mode, while integrating with resource manager systems), the
> > > integration with Helix won't be difficult.
> > >
> > > 1. http://wiki.apache.org/hama/SyncService
> > >
> > > On Thu, May 9, 2013 at 7:01 AM, kishore g <g.kishore@gmail.com> wrote:
> > > > Hello,
> > > >
> > > > I am starting a discussion thread on potential pros/cons of using
> Helix
> > > in
> > > > Hama. I dont know the internal details of Hama, so please correct me
> if
> > > > something does not make sense.
> > > >
> > > > My source of information is
> > http://wiki.apache.org/hama/Architectureand a
> > > > brief chat with Suraj at ApacheCon where he described the need for
> > > barriers
> > > > between super steps.
> > > >
> > > > Please read about Apache Helix here
> http://helix.incubator.apache.org/
> > .
> > > >
> > > > Architecture wise Helix maps pretty well with the components in Hama.
> > > > HelixController can be wrapped inside BSPMaster and GroomServer is
> the
> > > > PARTICIPANT in Helix terminology that wraps Helix Agent.
> > > >
> > > > The partitioning and assigning tasks to GroomServers can be done via
> > > Helix
> > > > Apis, it basically boils down to setting the idealstate for a
> > particular
> > > > stage. Starting of the next step which basically depends on all tasks
> > in
> > > > previous step being completed can be done by watching the
> ExternalView.
> > > >
> > > > In the architecture wiki, I see that there is plan to integrate with
> > > > Zookeeper for fault tolerance. Helix internally uses Zookeeper to
> store
> > > the
> > > > cluster state. So it might make it easier to make the tasks fault
> > > tolerant
> > > > and probably restartable as well at a task level instead of job/stage
> > > level.
> > > >
> > > > We recently added a recipe in Helix to demonstrate the concept of
> > > > dependency between resources.
> > > >
> > > > http://helix.incubator.apache.org/recipes/task_dag_execution.html
> > > > Code:
> > > >
> > >
> >
> https://github.com/apache/incubator-helix/tree/master/recipes/task-execution/src/main/java/org/apache/helix/taskexecution
> > > >
> > > > Let me know your thoughts.
> > > >
> > > > thanks,
> > > > Kishore G
> > >
> > >
> > >
> > > --
> > > Best Regards, Edward J. Yoon
> > > @eddieyoon
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message