incubator-hama-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Praveen Sripati <praveensrip...@gmail.com>
Subject Re: Hama and Data Locality
Date Tue, 03 Apr 2012 01:30:38 GMT
> But this is not trivial, because you can't predict which task is
communicating with other tasks.

One way is to try to get the tasks as close as possible to avoid any
network overhead.

On Mon, Apr 2, 2012 at 11:30 PM, Thomas Jungblut <
thomas.jungblut@googlemail.com> wrote:

> Data locality is much more needed than in Hadoop, because we need the
> bandwidth to exchange messages and not to stream the input.
> You could argue that you can not read from the input by design, but making
> the computation phase longer is not our intend.
>
> Suraj is also right, we have to use the network topology to make smart
> decisions on scheduling.
> But this is not trivial, because you can't predict which task is
> communicating with other tasks.
>
> Am 2. April 2012 18:39 schrieb Suraj Menon <surajsmenon@apache.org>:
>
> > Hi Praveen,
> >
> > I did not run any experiments with multiple racks to support my claim
> > although I intend to.
> > But it seems logical that in pursuit of data-locality, there is a good
> > chance that few of our tasks can get scheduled on different racks.
> > Now this helps reduce the time to read all inputs from the input file,
> but
> > Hama performance is also dependent upon how fast the BSP peer nodes can
> > transfer all the messages across each other in subsequent supersteps.
> > Unlike mapper task who completes after reading all the input records, the
> > bsp task continues running on the same machine where it started. So for
> > example, if the task requires 1000 supersteps, we would have got the
> first
> > superstep to complete in shortest time with data locality but would delay
> > the remaining 999 supersteps because of the increased delay in
> transferring
> > messages. Hence the opinion that topological information should be used
> > with higher priority than the data-locality needs. I know I should back
> > every claim with data, I would be doing it soon with my VM setup. :)
> >
> > Also in the meantime, I came across this -
> > https://issues.apache.org/jira/browse/HDFS-385
> > We should keep this in mind for pure Hama clusters.
> >
> > -Suraj
> >
> > On Mon, Apr 2, 2012 at 12:17 PM, Praveen Sripati
> > <praveensripati@gmail.com>wrote:
> >
> > > > https://issues.apache.org/jira/browse/HAMA-543
> > >
> > > > While working on it, I realized that this won't necessarily improve
> the
> > > performance, because the resource requirements for Hama is different
> from
> > > Hadoop. This change would move the mapper tasks closer to the input as
> in
> > > Hadoop. But in case of Hama tasks continue running on that machine
> > > throughout its lifetime. If in search of data-locality, the tasks get
> > > scheduled such that the communication between the nodes are costlier
> than
> > > normal (e.g. tasks resident in separate racks), then this change would
> > > degrade the performance.
> > >
> > > Doesn't data locality improve the performance of Hama?
> > >
> > > Praveen
> > >
> >
>
>
>
> --
> Thomas Jungblut
> Berlin <thomas.jungblut@gmail.com>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message