Return-Path: X-Original-To: apmail-giraph-user-archive@www.apache.org Delivered-To: apmail-giraph-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 420D7109B5 for ; Mon, 14 Oct 2013 19:17:25 +0000 (UTC) Received: (qmail 23513 invoked by uid 500); 14 Oct 2013 19:17:24 -0000 Delivered-To: apmail-giraph-user-archive@giraph.apache.org Received: (qmail 23244 invoked by uid 500); 14 Oct 2013 19:17:22 -0000 Mailing-List: contact user-help@giraph.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@giraph.apache.org Delivered-To: mailing list user@giraph.apache.org Received: (qmail 23231 invoked by uid 99); 14 Oct 2013 19:17:22 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 14 Oct 2013 19:17:22 +0000 X-ASF-Spam-Status: No, hits=1.7 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of sundi133@gmail.com designates 209.85.223.178 as permitted sender) Received: from [209.85.223.178] (HELO mail-ie0-f178.google.com) (209.85.223.178) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 14 Oct 2013 19:17:14 +0000 Received: by mail-ie0-f178.google.com with SMTP id to1so15427018ieb.23 for ; Mon, 14 Oct 2013 12:16:53 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=SVQ5FA9AhdPbcUFkxcMCXKQLf0gSHUvoiLf1CSzjHso=; b=jrLB22gWuXOcDfbdBwXIXLxWe3980M9zBcNGqeJxJofGfPS3L6PFlcFoXKTz4JqY+n Ip9PEypuViZw4mdzORaSfqAF22Tf2TOzTXMeahy+d7SFfphMAYjCagiNQl0IIorOKMgc IJIuhaAaNfUp/PknaC+P4rI5YFjlG5+F1U4gN0l64/Jau3JWrLC1djCLIEkLfv0T9P+z XMMGDsnqfGXfSBtV1Tc4zmVzD+2S0wQM7hpa5XbnLMA1xy+wjWIkUwv64933IrbwhCtF rigRPaaUs7fJlVph+sD8jr+X8uGGGesnJqhNpAfGurLdxgFwo4XQ5/lDwPhCEt06dIgW 3XRA== MIME-Version: 1.0 X-Received: by 10.50.103.6 with SMTP id fs6mr14525704igb.16.1381778213161; Mon, 14 Oct 2013 12:16:53 -0700 (PDT) Received: by 10.64.249.66 with HTTP; Mon, 14 Oct 2013 12:16:53 -0700 (PDT) In-Reply-To: References: Date: Mon, 14 Oct 2013 12:16:53 -0700 Message-ID: Subject: Re: giraph hanging after superstep From: Jyotirmoy Sundi To: user@giraph.apache.org Content-Type: multipart/alternative; boundary=047d7b2e0abb8ddc3404e8b84fc9 X-Virus-Checked: Checked by ClamAV on apache.org --047d7b2e0abb8ddc3404e8b84fc9 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable The latest trunk compiled without the need not change any interfaces apart from just adding a new exception to one of the class. On Mon, Oct 14, 2013 at 11:40 AM, Jyotirmoy Sundi wrote= : > Thanks will try that out, rewriting in saveVertices to match the new > interfaces does not seem too big. > Did you find out later what might be a potential issues for the same ? > > Thanks > Sund > > > On Mon, Oct 14, 2013 at 11:26 AM, Manuel Lagang w= rote: > >> I also had the same issues when I used the out-of-core features, even fo= r >> trivial datasets, when I used the 1.0.0-RC3 branch. The job would seem t= o >> finish all supersteps, but it would hang during the final output of data= to >> HDFS. I found that if I used the latest code in trunk instead (which >> required some rewriting to match the new interface), then my jobs would >> finish fine. >> >> >> On Mon, Oct 14, 2013 at 11:13 AM, Jyotirmoy Sundi wr= ote: >> >>> Hi folks, >>> We are successfully able to run Giraph for 1B vertices and >>> around 20B edges in our cluster. This is great. But when we run it over= 5B >>> vertices over the actual data and around 50B edges we see some issues i= n >>> the final step while offloading the partitions. Since the dataset is hu= ge >>> for our cluster, we are using giraph.useOutOfCoreGraph and giraph.useOu= tOfCoreMessages >>> to spill the data when overloaded.With this setup all the supersteps >>> finished within around 4 hours. But in the final step after reporting >>> saving vertices in task status, it hangs after writing a few partitions= , it >>> is happening consistently in our case. I played with all the config >>> params and nothing is helping out, any suggestions from you will be rea= lly >>> helpful. Thanks a lot. >>> >>> The log snippet: >>> >>> 2013-10-14 10:24:20,144 INFO org.apache.giraph.worker.BspServiceWorker:= saveVertices: Starting to save 26146422 vertices >>> 2013-10-14 10:24:20,183 INFO org.apache.giraph.partition.DiskBackedPart= itionStore: offloadPartition: writing partition vertices 1922 to /mnt/diskg= /mapred/local/taskTracker/sundi133/jobcache/job_201310130212_0013/attempt_2= 01310130212_0013_m_000060_0/work/_bsp/_partitions/job_201310130212_0013/par= tition-1922_vertices >>> 2013-10-14 10:24:20,307 WARN org.apache.giraph.bsp.BspService: process:= Unknown and unprocessed event (path=3D/_hadoopBsp/job_201310130212_0013/_a= pplicationAttemptsDir/0/_superstepDir/15/_addressesAndPartitions, type=3DNo= deDeleted, state=3DSyncConnected) >>> 2013-10-14 10:24:20,431 WARN org.apache.giraph.bsp.BspService: process:= Unknown and unprocessed event (path=3D/_hadoopBsp/job_201310130212_0013/_a= pplicationAttemptsDir/0/_superstepDir/15/_superstepFinished, type=3DNodeDel= eted, state=3DSyncConnected) >>> 2013-10-14 10:24:20,555 INFO org.apache.giraph.worker.BspServiceWorker:= processEvent: Job state changed, checking to see if it needs to restart >>> 2013-10-14 10:24:20,640 INFO org.apache.giraph.bsp.BspService: getJobSt= ate: Job state already exists (/_hadoopBsp/job_201310130212_0013/_masterJob= State) >>> 2013-10-14 10:24:22,928 INFO org.apache.giraph.partition.DiskBackedPart= itionStore: offloadPartition: writing partition vertices 13762 to /mnt/disk= g/mapred/local/taskTracker/sundi133/jobcache/job_201310130212_0013/attempt_= 201310130212_0013_m_000060_0/work/_bsp/_partitions/job_201310130212_0013/pa= rtition-13762_vertices >>> 2013-10-14 10:24:27,648 INFO org.apache.giraph.partition.DiskBackedPart= itionStore: offloadPartition: writing partition vertices 23682 to /mnt/disk= g/mapred/local/taskTracker/sundi133/jobcache/job_201310130212_0013/attempt_= 201310130212_0013_m_000060_0/work/_bsp/_partitions/job_201310130212_0013/pa= rtition-23682_vertices >>> 2013-10-14 10:24:30,557 INFO org.apache.giraph.partition.DiskBackedPart= itionStore: offloadPartition: writing partition vertices 14882 to /mnt/disk= g/mapred/local/taskTracker/sundi133/jobcache/job_201310130212_0013/attempt_= 201310130212_0013_m_000060_0/work/_bsp/_partitions/job_201310130212_0013/pa= rtition-14882_vertices >>> 2013-10-14 10:24:32,935 INFO org.apache.giraph.partition.DiskBackedPart= itionStore: offloadPartition: writing partition vertices 11842 to /mnt/disk= g/mapred/local/taskTracker/sundi133/jobcache/job_201310130212_0013/attempt_= 201310130212_0013_m_000060_0/work/_bsp/_partitions/job_201310130212_0013/pa= rtition-11842_vertices >>> 2013-10-14 10:24:33,714 INFO org.apache.giraph.partition.DiskBackedPart= itionStore: offloadPartition: writing partition vertices 962 to /mnt/diskg/= mapred/local/taskTracker/sundi133/jobcache/job_201310130212_0013/attempt_20= 1310130212_0013_m_000060_0/work/_bsp/_partitions/job_201310130212_0013/part= ition-962_vertices >>> 2013-10-14 10:24:35,184 INFO org.apache.giraph.worker.BspServiceWorker:= saveVertices: Saved 978047 out of 26146422 vertices, on partition 5 out of= 160 >>> 2013-10-14 10:24:35,187 INFO org.apache.giraph.partition.DiskBackedPart= itionStore: offloadPartition: writing partition vertices 22722 to /mnt/disk= g/mapred/local/taskTracker/sundi133/jobcache/job_201310130212_0013/attempt_= 201310130212_0013_m_000060_0/work/_bsp/_partitions/job_201310130212_0013/pa= rtition-22722_vertices >>> 2013-10-14 10:24:37,276 INFO org.apache.giraph.partition.DiskBackedPart= itionStore: offloadPartition: writing partition vertices 21762 to /mnt/disk= g/mapred/local/taskTracker/sundi133/jobcache/job_201310130212_0013/attempt_= 201310130212_0013_m_000060_0/work/_bsp/_partitions/job_201310130212_0013/pa= rtition-21762_vertices >>> 2013-10-14 10:24:39,868 INFO org.apache.giraph.partition.DiskBackedPart= itionStore: offloadPartition: writing partition vertices 11362 to /mnt/disk= g/mapred/local/taskTracker/sundi133/jobcache/job_201310130212_0013/attempt_= 201310130212_0013_m_000060_0/work/_bsp/_partitions/job_201310130212_0013/pa= rtition-11362_vertices >>> 2013-10-14 10:24:41,391 INFO org.apache.giraph.partition.DiskBackedPart= itionStore: offloadPartition: writing partition vertices 482 to /mnt/diskg/= mapred/local/taskTracker/sundi133/jobcache/job_201310130212_0013/attempt_20= 1310130212_0013_m_000060_0/work/_bsp/_partitions/job_201310130212_0013/part= ition-482_vertices >>> >>> ------------------------------ >>> >>> >>> *The error show in the job failure page for each attempt* >>> >>> >>> >>> FAILED >>> >>> >>> Task attempt_201310130212_0013_m_000001_0 failed to report status for 7= 200 seconds. Killing! >>> >>> >>> -- >>> Best Regards, >>> Jyotirmoy Sundi >>> Data Engineer, >>> Admobius >>> >>> San Francisco, CA 94158 >>> >> >> > > > -- > Best Regards, > Jyotirmoy Sundi > Data Engineer, > Admobius > > San Francisco, CA 94158 > --=20 Best Regards, Jyotirmoy Sundi Data Engineer, Admobius San Francisco, CA 94158 --047d7b2e0abb8ddc3404e8b84fc9 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
The latest trunk compiled without the need not change any = interfaces apart from just adding a new exception to one of the class.


On Mon, Oct = 14, 2013 at 11:40 AM, Jyotirmoy Sundi <sundi133@gmail.com> = wrote:
Thanks will try that out, r= ewriting in saveVertices to match the new interfaces does not seem too big.=
Did you find out later what might be a potential issues for the same ?

Thanks
Sund


On Mon, Oct 14, 2013 at 11:= 26 AM, Manuel Lagang <manuellagang@gmail.com> wrote:
I also had the same issues = when I used the out-of-core features, even for trivial datasets, when I use= d the 1.0.0-RC3 branch. The job would seem to finish all supersteps, but it= would hang during the final output of data to HDFS. I found that if I used= the latest code in trunk instead (which required some rewriting to match t= he new interface), then my jobs would finish fine.


On Mon, Oct 1= 4, 2013 at 11:13 AM, Jyotirmoy Sundi <sundi133@gmail.com> w= rote:
Hi folks,
=A0 =A0 =A0 =A0 =A0 We are successfully able to run Giraph for 1B vertices = and around 20B edges in our cluster. This is great. But when we run it over= 5B vertices over the actual data and around 50B edges we see some issues i= n the final step while offloading the partitions. Since the dataset is huge= for our cluster, we are using=A0giraph.useOutOfCoreGraph and=A0giraph= .useOutOfCoreMessages to spill the data when overloaded.With this setup all= the supersteps finished within around 4 hours. But in the final step after= reporting saving vertices in task status, it hangs after writing a few par= titions, it is happening consistently in our case.=A0I played with a= ll the config params and nothing is helping out, any suggestions from you w= ill be really helpful. Thanks a lot.
=
=A0The log snippet:
=
2013-10-14 10:24:20,144 INFO org.apache.gi= raph.worker.BspServiceWorker: saveVertices: Starting to save 26146422 verti= ces 2013-10-14 10:24:20,183 INFO org.apache.giraph.partition.DiskBackedPartitio= nStore: offloadPartition: writing partition vertices 1922 to /mnt/diskg/map= red/local/taskTracker/sundi133/jobcache/job_201310130212_0013/attempt_20131= 0130212_0013_m_000060_0/work/_bsp/_partitions/job_201310130212_0013/partiti= on-1922_vertices 2013-10-14 10:24:20,307 WARN org.apache.giraph.bsp.BspService: process: Unk= nown and unprocessed event (path=3D/_hadoopBsp/job_201310130212_0013/_appli= cationAttemptsDir/0/_superstepDir/15/_addressesAndPartitions, type=3DNodeDe= leted, state=3DSyncConnected) 2013-10-14 10:24:20,431 WARN org.apache.giraph.bsp.BspService: process: Unk= nown and unprocessed event (path=3D/_hadoopBsp/job_201310130212_0013/_appli= cationAttemptsDir/0/_superstepDir/15/_superstepFinished, type=3DNodeDeleted= , state=3DSyncConnected) 2013-10-14 10:24:20,555 INFO org.apache.giraph.worker.BspServiceWorker: pro= cessEvent: Job state changed, checking to see if it needs to restart 2013-10-14 10:24:20,640 INFO org.apache.giraph.bsp.BspService: getJobState:= Job state already exists (/_hadoopBsp/job_201310130212_0013/_masterJobStat= e) 2013-10-14 10:24:22,928 INFO org.apache.giraph.partition.DiskBackedPartitio= nStore: offloadPartition: writing partition vertices 13762 to /mnt/diskg/ma= pred/local/taskTracker/sundi133/jobcache/job_201310130212_0013/attempt_2013= 10130212_0013_m_000060_0/work/_bsp/_partitions/job_201310130212_0013/partit= ion-13762_vertices 2013-10-14 10:24:27,648 INFO org.apache.giraph.partition.DiskBackedPartitio= nStore: offloadPartition: writing partition vertices 23682 to /mnt/diskg/ma= pred/local/taskTracker/sundi133/jobcache/job_201310130212_0013/attempt_2013= 10130212_0013_m_000060_0/work/_bsp/_partitions/job_201310130212_0013/partit= ion-23682_vertices 2013-10-14 10:24:30,557 INFO org.apache.giraph.partition.DiskBackedPartitio= nStore: offloadPartition: writing partition vertices 14882 to /mnt/diskg/ma= pred/local/taskTracker/sundi133/jobcache/job_201310130212_0013/attempt_2013= 10130212_0013_m_000060_0/work/_bsp/_partitions/job_201310130212_0013/partit= ion-14882_vertices 2013-10-14 10:24:32,935 INFO org.apache.giraph.partition.DiskBackedPartitio= nStore: offloadPartition: writing partition vertices 11842 to /mnt/diskg/ma= pred/local/taskTracker/sundi133/jobcache/job_201310130212_0013/attempt_2013= 10130212_0013_m_000060_0/work/_bsp/_partitions/job_201310130212_0013/partit= ion-11842_vertices 2013-10-14 10:24:33,714 INFO org.apache.giraph.partition.DiskBackedPartitio= nStore: offloadPartition: writing partition vertices 962 to /mnt/diskg/mapr= ed/local/taskTracker/sundi133/jobcache/job_201310130212_0013/attempt_201310= 130212_0013_m_000060_0/work/_bsp/_partitions/job_201310130212_0013/partitio= n-962_vertices 2013-10-14 10:24:35,184 INFO org.apache.giraph.worker.BspServiceWorker: sav= eVertices: Saved 978047 out of 26146422 vertices, on partition 5 out of 160 2013-10-14 10:24:35,187 INFO org.apache.giraph.partition.DiskBackedPartitio= nStore: offloadPartition: writing partition vertices 22722 to /mnt/diskg/ma= pred/local/taskTracker/sundi133/jobcache/job_201310130212_0013/attempt_2013= 10130212_0013_m_000060_0/work/_bsp/_partitions/job_201310130212_0013/partit= ion-22722_vertices 2013-10-14 10:24:37,276 INFO org.apache.giraph.partition.DiskBackedPartitio= nStore: offloadPartition: writing partition vertices 21762 to /mnt/diskg/ma= pred/local/taskTracker/sundi133/jobcache/job_201310130212_0013/attempt_2013= 10130212_0013_m_000060_0/work/_bsp/_partitions/job_201310130212_0013/partit= ion-21762_vertices 2013-10-14 10:24:39,868 INFO org.apache.giraph.partition.DiskBackedPartitio= nStore: offloadPartition: writing partition vertices 11362 to /mnt/diskg/ma= pred/local/taskTracker/sundi133/jobcache/job_201310130212_0013/attempt_2013= 10130212_0013_m_000060_0/work/_bsp/_partitions/job_201310130212_0013/partit= ion-11362_vertices 2013-10-14 10:24:41,391 INFO org.apache.giraph.partition.DiskBackedPartitio= nStore: offloadPartition: writing partition vertices 482 to /mnt/diskg/mapr= ed/local/taskTracker/sundi133/jobcache/job_201310130212_0013/attempt_201310= 130212_0013_m_000060_0/work/_bsp/_partitions/job_201310130212_0013/partitio= n-482_vertices


The error show in the job failure page for each attempt


FAILED

Task attempt_201310130212_0013_m_000001_0 failed to report status for 7200 =
seconds. Killing!

--
Best Regards,=
Jyotirmoy Sundi
Data Engineer,
Admobius

Sa= n Francisco, CA 94158





--
=
Best Regards,
Jyotirmoy Sundi
Data Engineer,
Admobius

Sa= n Francisco, CA 94158




--
=
B= est Regards,
Jyotirmoy Sundi
Data Engineer,
Admobius
=

San Francisco, CA 94158

--047d7b2e0abb8ddc3404e8b84fc9--