giraph-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eli Reisman (JIRA)" <>
Subject [jira] [Commented] (GIRAPH-275) Restore data locality to workers reading InputSplits where possible without querying NameNode, ZooKeeper
Date Mon, 06 Aug 2012 17:36:02 GMT


Eli Reisman commented on GIRAPH-275:

Were you thinking of also throttling the # of total splits so that there are never more workers
than splits? That would not be a bad idea. Usually in large runs (previously and with this
patch) I would include about 20% more workers than input splits (sometimes more) because this
way there were enough workers & partitions of vertices to go around that even the workers
who ended up also reading splits had a smaller share of the partitioned data to send/recv
over Netty before the calculation super steps could begin. But, we are looking at spill to
disk now to alleviate some of this, and while my focus is "Scale out with more workers," that
is not everyone's scale strategy.

If anyone has a problem with this idea that runs on small clusters and might want each worker
to read 2-3 splits then let me know, otherwise i can easily add this limit. Since we do have
complete choice of # of workers (and how much data we put in) I have been managing this with
a back of the envelope calc before a given job run, so maybe adding some comments to the code
and leaving users the freedom to manage this on their own is more flexible?

> Restore data locality to workers reading InputSplits where possible without querying
NameNode, ZooKeeper
> --------------------------------------------------------------------------------------------------------
>                 Key: GIRAPH-275
>                 URL:
>             Project: Giraph
>          Issue Type: Improvement
>          Components: bsp, graph
>    Affects Versions: 0.2.0
>            Reporter: Eli Reisman
>            Assignee: Eli Reisman
>             Fix For: 0.2.0
>         Attachments: GIRAPH-275-1.patch, GIRAPH-275-2.patch, GIRAPH-275-3.patch, GIRAPH-275-4.patch
> During INPUT_SUPERSTEP, workers wait on a barrier until the master has created a complete
list of available input splits. Once the barrier is past, each worker iterates through this
list of input splits, creating a znode to lay claim to the next unprocessed split the worker
> For a brief moment while the master is creating the input split znodes each worker iterates
through, it has access to InputSplit objects that also contain a list of hostnames on which
the blocks of the file are hosted. By including that list of locations in each znode pathname
we can allow each worker reading the list of available splits to sort it so that splits the
worker attempts to claim first are the ones that contain a block that is local to that worker's
> This allows the possibility for many workers to end up reading at least one split that
is local to its own host. If the input split selected holds a local block, the RecordReader
Hadoop supplies us with will automatically read from that block anyway. By supplying this
locality data as part of the znode name rather than info inside the znode, we avoid reading
the data from each znode while sorting, which is only currently done when a split is claimed
and which is IO intensive. Sorting the string path data is cheap and faster, and making the
final split znode's name longer doesn't seem to matter too much.
> By using the BspMaster's InputSplit data to include locality information in the znode
path directly, we also avoid having to access the FileSystem/BlockLocations directly from
either master or workers, which could also flood the name node with queries. This is the only
place I've found where some locality information is already available to Giraph free of additional
> Finally, by sorting each worker's split list this way, we get the contention-reduction
of GIRAPH-250 for free, since only workers on the same host will be likely to contend for
a split instead of the current situation in which all workers contend for the same input splits
from the same list, iterating from the same index. GIRAPH-250 has already been logged as reducing
pages of contention on the first pass (when using many 100's of workers) down to 0-3 contentions
before claiming a split to read.
> This passes 'mvn verify' etc. I will post results of cluster testing ASAP. If anyone
else could try this on an HDFS cluster where locality info is supplied to InputSplit objects,
I would be really interested to see other folks' results.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:!default.jspa
For more information on JIRA, see:


View raw message