incubator-hama-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thomas Jungblut <>
Subject Re: Hama and Data Locality
Date Mon, 02 Apr 2012 18:00:43 GMT
Data locality is much more needed than in Hadoop, because we need the
bandwidth to exchange messages and not to stream the input.
You could argue that you can not read from the input by design, but making
the computation phase longer is not our intend.

Suraj is also right, we have to use the network topology to make smart
decisions on scheduling.
But this is not trivial, because you can't predict which task is
communicating with other tasks.

Am 2. April 2012 18:39 schrieb Suraj Menon <>:

> Hi Praveen,
> I did not run any experiments with multiple racks to support my claim
> although I intend to.
> But it seems logical that in pursuit of data-locality, there is a good
> chance that few of our tasks can get scheduled on different racks.
> Now this helps reduce the time to read all inputs from the input file, but
> Hama performance is also dependent upon how fast the BSP peer nodes can
> transfer all the messages across each other in subsequent supersteps.
> Unlike mapper task who completes after reading all the input records, the
> bsp task continues running on the same machine where it started. So for
> example, if the task requires 1000 supersteps, we would have got the first
> superstep to complete in shortest time with data locality but would delay
> the remaining 999 supersteps because of the increased delay in transferring
> messages. Hence the opinion that topological information should be used
> with higher priority than the data-locality needs. I know I should back
> every claim with data, I would be doing it soon with my VM setup. :)
> Also in the meantime, I came across this -
> We should keep this in mind for pure Hama clusters.
> -Suraj
> On Mon, Apr 2, 2012 at 12:17 PM, Praveen Sripati
> <>wrote:
> > >
> >
> > > While working on it, I realized that this won't necessarily improve the
> > performance, because the resource requirements for Hama is different from
> > Hadoop. This change would move the mapper tasks closer to the input as in
> > Hadoop. But in case of Hama tasks continue running on that machine
> > throughout its lifetime. If in search of data-locality, the tasks get
> > scheduled such that the communication between the nodes are costlier than
> > normal (e.g. tasks resident in separate racks), then this change would
> > degrade the performance.
> >
> > Doesn't data locality improve the performance of Hama?
> >
> > Praveen
> >

Thomas Jungblut
Berlin <>

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message