Return-Path: X-Original-To: apmail-giraph-user-archive@www.apache.org Delivered-To: apmail-giraph-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id EF70110888 for ; Tue, 4 Nov 2014 15:28:44 +0000 (UTC) Received: (qmail 55291 invoked by uid 500); 4 Nov 2014 15:28:44 -0000 Delivered-To: apmail-giraph-user-archive@giraph.apache.org Received: (qmail 55240 invoked by uid 500); 4 Nov 2014 15:28:44 -0000 Mailing-List: contact user-help@giraph.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@giraph.apache.org Delivered-To: mailing list user@giraph.apache.org Received: (qmail 55230 invoked by uid 99); 4 Nov 2014 15:28:44 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 04 Nov 2014 15:28:44 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of pavanka@outlook.com designates 65.54.190.226 as permitted sender) Received: from [65.54.190.226] (HELO BAY004-OMC4S24.hotmail.com) (65.54.190.226) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 04 Nov 2014 15:28:38 +0000 Received: from BAY176-W4 ([65.54.190.201]) by BAY004-OMC4S24.hotmail.com over TLS secured channel with Microsoft SMTPSVC(7.5.7601.22751); Tue, 4 Nov 2014 07:28:16 -0800 X-TMN: [YjMKQKLVl2d4LzRKZF9IG4czws7omiJ5] X-Originating-Email: [pavanka@outlook.com] Message-ID: Content-Type: multipart/alternative; boundary="_9b7c79bc-8f01-4f9b-a21d-8367888f49b1_" From: Pavan Kumar A To: "user@giraph.apache.org" Subject: RE: Graph partitioning and data locality Date: Tue, 4 Nov 2014 20:58:16 +0530 Importance: Normal In-Reply-To: References: <545881F0.8080605@gmx.net>, MIME-Version: 1.0 X-OriginalArrivalTime: 04 Nov 2014 15:28:16.0960 (UTC) FILETIME=[F4BDA400:01CFF843] X-Virus-Checked: Checked by ClamAV on apache.org --_9b7c79bc-8f01-4f9b-a21d-8367888f49b1_ Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable You can also look at https://issues.apache.org/jira/browse/GIRAPH-908which = solves the case where you have a partition map and would like graph to be p= artitioned that way after loading the input. It does not however solve the = {do not shuffle data part} From: claudio.martella@gmail.com Date: Tue=2C 4 Nov 2014 16:20:21 +0100 Subject: Re: Graph partitioning and data locality To: user@giraph.apache.org Hi=2C answers are inline. On Tue=2C Nov 4=2C 2014 at 8:36 AM=2C Martin Junghanns wrote: Hi group=2C =0A= =0A= I got a question concerning the graph partitioning step. If I understood th= e code correctly=2C the graph is distributed to n partitions by using verte= xID.hashCode() & n. I got two questions concerning that step. =0A= =0A= 1) Is the whole graph loaded and partitioned only by the Master? This would= mean=2C the whole data has to be moved to that Master map job and then mov= ed to the physical node the specific worker for the partition runs on. As t= his sounds like a huge overhead=2C I further inspected the code: =0A= I saw that there is also a WorkerGraphPartitioner and I assume he calls the= partitioning method on his local data (lets say his local HDFS blocks) and= if the resulting partition for a vertex is not himself=2C the data gets mo= ved to that worker=2C which reduces the overhead. Is this assumption correc= t? That is correct=2C workers forward vertex data to the correct worker who is= responsible for that vertex via hash-partitioning (by default)=2C meaning = that the master is not involved. =0A= =0A= 2) Let's say the graph is already partitioned in the file system=2C e.g. bl= ocks on physical nodes contain logical connected graph nodes. Is it possibl= e to just read the data as it is and skip the partitioning step? In that ca= se I currently assume=2C that the vertexID should contain the partitionID a= nd the custom partitioning would be an identity function in that case (inst= ead of hashing or range). In principle you can. You would need to organize splits so that they contai= n all the data for each particular worker=2C and then assign relevant split= s to the corresponding worker. =0A= =0A= Thanks for your time and help! =0A= =0A= Cheers=2C =0A= Martin =0A= --=20 Claudio Martella =0A= = --_9b7c79bc-8f01-4f9b-a21d-8367888f49b1_ Content-Type: text/html; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable
You can also look at =3B= https://issues.apache.org/jira/browse/GIRAPH-908
which solves the c= ase where you have a partition map and would like graph to be partitioned t= hat way after loading the input. It does not however solve the {do not shuf= fle data part}


From: claudio.martella@g= mail.com
Date: Tue=2C 4 Nov 2014 16:20:21 +0100
Subject: Re: Graph pa= rtitioning and data locality
To: user@giraph.apache.org

Hi=2C

answers are inline.

On Tue=2C Nov 4=2C 2014 at = 8:36 AM=2C Martin Junghanns <=3Bmartin.junghanns@gmx.net>=3B wrote:
Hi group=2C
=0A=
=0A= I got a question concerning the graph partitioning step. If I understood th= e code correctly=2C the graph is distributed to n partitions by using verte= xID.hashCode() &=3B n. I got two questions concerning that step.
=0A=
=0A= 1) Is the whole graph loaded and partitioned only by the Master? This would= mean=2C the whole data has to be moved to that Master map job and then mov= ed to the physical node the specific worker for the partition runs on. As t= his sounds like a huge overhead=2C I further inspected the code:
=0A= I saw that there is also a WorkerGraphPartitioner and I assume he calls the= partitioning method on his local data (lets say his local HDFS blocks) and= if the resulting partition for a vertex is not himself=2C the data gets mo= ved to that worker=2C which reduces the overhead. Is this assumption correc= t?

That is correct=2C workers forward v= ertex data to the correct worker who is responsible for that vertex via has= h-partitioning (by default)=2C meaning that the master is not involved.
 =3B
=0A=
=0A= 2) Let's say the graph is already partitioned in the file system=2C e.g. bl= ocks on physical nodes contain logical connected graph nodes. Is it possibl= e to just read the data as it is and skip the partitioning step? In that ca= se I currently assume=2C that the vertexID should contain the partitionID a= nd the custom partitioning would be an identity function in that case (inst= ead of hashing or range).

In principle = you can. You would need to organize splits so that they contain all the dat= a for each particular worker=2C and then assign relevant splits to the corr= esponding worker.
 =3B
=0A=
=0A= Thanks for your time and help!
=0A=
=0A= Cheers=2C
=0A= Martin
=0A=



--
 =3B  =3BClaudio Martella=
 =3B  =3B
=0A=
= --_9b7c79bc-8f01-4f9b-a21d-8367888f49b1_--