Return-Path: X-Original-To: apmail-giraph-user-archive@www.apache.org Delivered-To: apmail-giraph-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 4F0821089A for ; Mon, 14 Oct 2013 18:41:25 +0000 (UTC) Received: (qmail 52460 invoked by uid 500); 14 Oct 2013 18:41:24 -0000 Delivered-To: apmail-giraph-user-archive@giraph.apache.org Received: (qmail 52245 invoked by uid 500); 14 Oct 2013 18:41:24 -0000 Mailing-List: contact user-help@giraph.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@giraph.apache.org Delivered-To: mailing list user@giraph.apache.org Received: (qmail 52235 invoked by uid 99); 14 Oct 2013 18:41:23 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 14 Oct 2013 18:41:23 +0000 X-ASF-Spam-Status: No, hits=1.7 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of sundi133@gmail.com designates 209.85.223.169 as permitted sender) Received: from [209.85.223.169] (HELO mail-ie0-f169.google.com) (209.85.223.169) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 14 Oct 2013 18:41:14 +0000 Received: by mail-ie0-f169.google.com with SMTP id ar20so3875248iec.0 for ; Mon, 14 Oct 2013 11:40:53 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=uz70dKpSwkRDxl64FVZ5/fSBbFIFl7v4lGzoUuFNm88=; b=Kpm89yfr9GY6naXfcz1E9bPgMAUiursTpEvCfuc1UC4W4AdEMOBDpnqd40PgXK6WNh lLNRe2dOVC8s338BRO6sIih0Cl1oJNzaN35hlR+tp0rNfI7/u6BhQ+tTuHj5w5fulgq7 2tb+YrZxLtbghdzWfRlN195ubu/JWzR1XNrVJmxAkLcQ5iJEO5kE1joyY62qECxOrwAo VfklS1gMGuW81RYjbzvQYiCNSIRg/Xi0oll6lL20x+5tzRUCLn60LFezqhlwY1HFsLjZ oiGfbMoooAx88pvG1S+HdgQs0i4hm6vc++upRENiuGnjyO4M2QDASeqmXglchwoXbkPq CYUg== MIME-Version: 1.0 X-Received: by 10.42.189.201 with SMTP id df9mr1627113icb.54.1381776053648; Mon, 14 Oct 2013 11:40:53 -0700 (PDT) Received: by 10.64.249.66 with HTTP; Mon, 14 Oct 2013 11:40:53 -0700 (PDT) In-Reply-To: References: Date: Mon, 14 Oct 2013 11:40:53 -0700 Message-ID: Subject: Re: giraph hanging after superstep From: Jyotirmoy Sundi To: user@giraph.apache.org Content-Type: multipart/alternative; boundary=20cf303bfb1ed64c8704e8b7ce31 X-Virus-Checked: Checked by ClamAV on apache.org --20cf303bfb1ed64c8704e8b7ce31 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Thanks will try that out, rewriting in saveVertices to match the new interfaces does not seem too big. Did you find out later what might be a potential issues for the same ? Thanks Sund On Mon, Oct 14, 2013 at 11:26 AM, Manuel Lagang wro= te: > I also had the same issues when I used the out-of-core features, even for > trivial datasets, when I used the 1.0.0-RC3 branch. The job would seem to > finish all supersteps, but it would hang during the final output of data = to > HDFS. I found that if I used the latest code in trunk instead (which > required some rewriting to match the new interface), then my jobs would > finish fine. > > > On Mon, Oct 14, 2013 at 11:13 AM, Jyotirmoy Sundi wro= te: > >> Hi folks, >> We are successfully able to run Giraph for 1B vertices and >> around 20B edges in our cluster. This is great. But when we run it over = 5B >> vertices over the actual data and around 50B edges we see some issues in >> the final step while offloading the partitions. Since the dataset is hug= e >> for our cluster, we are using giraph.useOutOfCoreGraph and giraph.useOut= OfCoreMessages >> to spill the data when overloaded.With this setup all the supersteps >> finished within around 4 hours. But in the final step after reporting >> saving vertices in task status, it hangs after writing a few partitions,= it >> is happening consistently in our case. I played with all the config >> params and nothing is helping out, any suggestions from you will be real= ly >> helpful. Thanks a lot. >> >> The log snippet: >> >> 2013-10-14 10:24:20,144 INFO org.apache.giraph.worker.BspServiceWorker: = saveVertices: Starting to save 26146422 vertices >> 2013-10-14 10:24:20,183 INFO org.apache.giraph.partition.DiskBackedParti= tionStore: offloadPartition: writing partition vertices 1922 to /mnt/diskg/= mapred/local/taskTracker/sundi133/jobcache/job_201310130212_0013/attempt_20= 1310130212_0013_m_000060_0/work/_bsp/_partitions/job_201310130212_0013/part= ition-1922_vertices >> 2013-10-14 10:24:20,307 WARN org.apache.giraph.bsp.BspService: process: = Unknown and unprocessed event (path=3D/_hadoopBsp/job_201310130212_0013/_ap= plicationAttemptsDir/0/_superstepDir/15/_addressesAndPartitions, type=3DNod= eDeleted, state=3DSyncConnected) >> 2013-10-14 10:24:20,431 WARN org.apache.giraph.bsp.BspService: process: = Unknown and unprocessed event (path=3D/_hadoopBsp/job_201310130212_0013/_ap= plicationAttemptsDir/0/_superstepDir/15/_superstepFinished, type=3DNodeDele= ted, state=3DSyncConnected) >> 2013-10-14 10:24:20,555 INFO org.apache.giraph.worker.BspServiceWorker: = processEvent: Job state changed, checking to see if it needs to restart >> 2013-10-14 10:24:20,640 INFO org.apache.giraph.bsp.BspService: getJobSta= te: Job state already exists (/_hadoopBsp/job_201310130212_0013/_masterJobS= tate) >> 2013-10-14 10:24:22,928 INFO org.apache.giraph.partition.DiskBackedParti= tionStore: offloadPartition: writing partition vertices 13762 to /mnt/diskg= /mapred/local/taskTracker/sundi133/jobcache/job_201310130212_0013/attempt_2= 01310130212_0013_m_000060_0/work/_bsp/_partitions/job_201310130212_0013/par= tition-13762_vertices >> 2013-10-14 10:24:27,648 INFO org.apache.giraph.partition.DiskBackedParti= tionStore: offloadPartition: writing partition vertices 23682 to /mnt/diskg= /mapred/local/taskTracker/sundi133/jobcache/job_201310130212_0013/attempt_2= 01310130212_0013_m_000060_0/work/_bsp/_partitions/job_201310130212_0013/par= tition-23682_vertices >> 2013-10-14 10:24:30,557 INFO org.apache.giraph.partition.DiskBackedParti= tionStore: offloadPartition: writing partition vertices 14882 to /mnt/diskg= /mapred/local/taskTracker/sundi133/jobcache/job_201310130212_0013/attempt_2= 01310130212_0013_m_000060_0/work/_bsp/_partitions/job_201310130212_0013/par= tition-14882_vertices >> 2013-10-14 10:24:32,935 INFO org.apache.giraph.partition.DiskBackedParti= tionStore: offloadPartition: writing partition vertices 11842 to /mnt/diskg= /mapred/local/taskTracker/sundi133/jobcache/job_201310130212_0013/attempt_2= 01310130212_0013_m_000060_0/work/_bsp/_partitions/job_201310130212_0013/par= tition-11842_vertices >> 2013-10-14 10:24:33,714 INFO org.apache.giraph.partition.DiskBackedParti= tionStore: offloadPartition: writing partition vertices 962 to /mnt/diskg/m= apred/local/taskTracker/sundi133/jobcache/job_201310130212_0013/attempt_201= 310130212_0013_m_000060_0/work/_bsp/_partitions/job_201310130212_0013/parti= tion-962_vertices >> 2013-10-14 10:24:35,184 INFO org.apache.giraph.worker.BspServiceWorker: = saveVertices: Saved 978047 out of 26146422 vertices, on partition 5 out of = 160 >> 2013-10-14 10:24:35,187 INFO org.apache.giraph.partition.DiskBackedParti= tionStore: offloadPartition: writing partition vertices 22722 to /mnt/diskg= /mapred/local/taskTracker/sundi133/jobcache/job_201310130212_0013/attempt_2= 01310130212_0013_m_000060_0/work/_bsp/_partitions/job_201310130212_0013/par= tition-22722_vertices >> 2013-10-14 10:24:37,276 INFO org.apache.giraph.partition.DiskBackedParti= tionStore: offloadPartition: writing partition vertices 21762 to /mnt/diskg= /mapred/local/taskTracker/sundi133/jobcache/job_201310130212_0013/attempt_2= 01310130212_0013_m_000060_0/work/_bsp/_partitions/job_201310130212_0013/par= tition-21762_vertices >> 2013-10-14 10:24:39,868 INFO org.apache.giraph.partition.DiskBackedParti= tionStore: offloadPartition: writing partition vertices 11362 to /mnt/diskg= /mapred/local/taskTracker/sundi133/jobcache/job_201310130212_0013/attempt_2= 01310130212_0013_m_000060_0/work/_bsp/_partitions/job_201310130212_0013/par= tition-11362_vertices >> 2013-10-14 10:24:41,391 INFO org.apache.giraph.partition.DiskBackedParti= tionStore: offloadPartition: writing partition vertices 482 to /mnt/diskg/m= apred/local/taskTracker/sundi133/jobcache/job_201310130212_0013/attempt_201= 310130212_0013_m_000060_0/work/_bsp/_partitions/job_201310130212_0013/parti= tion-482_vertices >> >> ------------------------------ >> >> >> *The error show in the job failure page for each attempt* >> >> >> >> FAILED >> >> >> Task attempt_201310130212_0013_m_000001_0 failed to report status for 72= 00 seconds. Killing! >> >> >> -- >> Best Regards, >> Jyotirmoy Sundi >> Data Engineer, >> Admobius >> >> San Francisco, CA 94158 >> > > --=20 Best Regards, Jyotirmoy Sundi Data Engineer, Admobius San Francisco, CA 94158 --20cf303bfb1ed64c8704e8b7ce31 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
Thanks will try that out, rewriting in saveVertices to mat= ch the new interfaces does not seem too big.
Did you find out later wha= t might be a potential issues for the same ?

Thank= s
Sund


On Mon, Oct 14, 2013 at 11:26 AM, Manuel Lagang &= lt;manuellagang= @gmail.com> wrote:
I also had the same issues = when I used the out-of-core features, even for trivial datasets, when I use= d the 1.0.0-RC3 branch. The job would seem to finish all supersteps, but it= would hang during the final output of data to HDFS. I found that if I used= the latest code in trunk instead (which required some rewriting to match t= he new interface), then my jobs would finish fine.


On Mon, Oct 1= 4, 2013 at 11:13 AM, Jyotirmoy Sundi <sundi133@gmail.com> w= rote:
Hi folks,
=A0 =A0 =A0 =A0 =A0 We are successfully able to run Giraph for 1B vertices = and around 20B edges in our cluster. This is great. But when we run it over= 5B vertices over the actual data and around 50B edges we see some issues i= n the final step while offloading the partitions. Since the dataset is huge= for our cluster, we are using=A0giraph.useOutOfCoreGraph and=A0giraph= .useOutOfCoreMessages to spill the data when overloaded.With this setup all= the supersteps finished within around 4 hours. But in the final step after= reporting saving vertices in task status, it hangs after writing a few par= titions, it is happening consistently in our case.=A0I played with a= ll the config params and nothing is helping out, any suggestions from you w= ill be really helpful. Thanks a lot.
=
=A0The log snippet:
=
2013-10-14 10:24:20,144 INFO org.apache.gi= raph.worker.BspServiceWorker: saveVertices: Starting to save 26146422 verti= ces 2013-10-14 10:24:20,183 INFO org.apache.giraph.partition.DiskBackedPartitio= nStore: offloadPartition: writing partition vertices 1922 to /mnt/diskg/map= red/local/taskTracker/sundi133/jobcache/job_201310130212_0013/attempt_20131= 0130212_0013_m_000060_0/work/_bsp/_partitions/job_201310130212_0013/partiti= on-1922_vertices 2013-10-14 10:24:20,307 WARN org.apache.giraph.bsp.BspService: process: Unk= nown and unprocessed event (path=3D/_hadoopBsp/job_201310130212_0013/_appli= cationAttemptsDir/0/_superstepDir/15/_addressesAndPartitions, type=3DNodeDe= leted, state=3DSyncConnected) 2013-10-14 10:24:20,431 WARN org.apache.giraph.bsp.BspService: process: Unk= nown and unprocessed event (path=3D/_hadoopBsp/job_201310130212_0013/_appli= cationAttemptsDir/0/_superstepDir/15/_superstepFinished, type=3DNodeDeleted= , state=3DSyncConnected) 2013-10-14 10:24:20,555 INFO org.apache.giraph.worker.BspServiceWorker: pro= cessEvent: Job state changed, checking to see if it needs to restart 2013-10-14 10:24:20,640 INFO org.apache.giraph.bsp.BspService: getJobState:= Job state already exists (/_hadoopBsp/job_201310130212_0013/_masterJobStat= e) 2013-10-14 10:24:22,928 INFO org.apache.giraph.partition.DiskBackedPartitio= nStore: offloadPartition: writing partition vertices 13762 to /mnt/diskg/ma= pred/local/taskTracker/sundi133/jobcache/job_201310130212_0013/attempt_2013= 10130212_0013_m_000060_0/work/_bsp/_partitions/job_201310130212_0013/partit= ion-13762_vertices 2013-10-14 10:24:27,648 INFO org.apache.giraph.partition.DiskBackedPartitio= nStore: offloadPartition: writing partition vertices 23682 to /mnt/diskg/ma= pred/local/taskTracker/sundi133/jobcache/job_201310130212_0013/attempt_2013= 10130212_0013_m_000060_0/work/_bsp/_partitions/job_201310130212_0013/partit= ion-23682_vertices 2013-10-14 10:24:30,557 INFO org.apache.giraph.partition.DiskBackedPartitio= nStore: offloadPartition: writing partition vertices 14882 to /mnt/diskg/ma= pred/local/taskTracker/sundi133/jobcache/job_201310130212_0013/attempt_2013= 10130212_0013_m_000060_0/work/_bsp/_partitions/job_201310130212_0013/partit= ion-14882_vertices 2013-10-14 10:24:32,935 INFO org.apache.giraph.partition.DiskBackedPartitio= nStore: offloadPartition: writing partition vertices 11842 to /mnt/diskg/ma= pred/local/taskTracker/sundi133/jobcache/job_201310130212_0013/attempt_2013= 10130212_0013_m_000060_0/work/_bsp/_partitions/job_201310130212_0013/partit= ion-11842_vertices 2013-10-14 10:24:33,714 INFO org.apache.giraph.partition.DiskBackedPartitio= nStore: offloadPartition: writing partition vertices 962 to /mnt/diskg/mapr= ed/local/taskTracker/sundi133/jobcache/job_201310130212_0013/attempt_201310= 130212_0013_m_000060_0/work/_bsp/_partitions/job_201310130212_0013/partitio= n-962_vertices 2013-10-14 10:24:35,184 INFO org.apache.giraph.worker.BspServiceWorker: sav= eVertices: Saved 978047 out of 26146422 vertices, on partition 5 out of 160 2013-10-14 10:24:35,187 INFO org.apache.giraph.partition.DiskBackedPartitio= nStore: offloadPartition: writing partition vertices 22722 to /mnt/diskg/ma= pred/local/taskTracker/sundi133/jobcache/job_201310130212_0013/attempt_2013= 10130212_0013_m_000060_0/work/_bsp/_partitions/job_201310130212_0013/partit= ion-22722_vertices 2013-10-14 10:24:37,276 INFO org.apache.giraph.partition.DiskBackedPartitio= nStore: offloadPartition: writing partition vertices 21762 to /mnt/diskg/ma= pred/local/taskTracker/sundi133/jobcache/job_201310130212_0013/attempt_2013= 10130212_0013_m_000060_0/work/_bsp/_partitions/job_201310130212_0013/partit= ion-21762_vertices 2013-10-14 10:24:39,868 INFO org.apache.giraph.partition.DiskBackedPartitio= nStore: offloadPartition: writing partition vertices 11362 to /mnt/diskg/ma= pred/local/taskTracker/sundi133/jobcache/job_201310130212_0013/attempt_2013= 10130212_0013_m_000060_0/work/_bsp/_partitions/job_201310130212_0013/partit= ion-11362_vertices 2013-10-14 10:24:41,391 INFO org.apache.giraph.partition.DiskBackedPartitio= nStore: offloadPartition: writing partition vertices 482 to /mnt/diskg/mapr= ed/local/taskTracker/sundi133/jobcache/job_201310130212_0013/attempt_201310= 130212_0013_m_000060_0/work/_bsp/_partitions/job_201310130212_0013/partitio= n-482_vertices


The error show in the job failure page for each attempt


FAILED

Task attempt_201310130212_0013_m_000001_0 failed to report status for 7200 =
seconds. Killing!

--
Best Regards,=
Jyotirmoy Sundi
Data Engineer,
Admobius

Sa= n Francisco, CA 94158





--
=
B= est Regards,
Jyotirmoy Sundi
Data Engineer,
Admobius
=

San Francisco, CA 94158

--20cf303bfb1ed64c8704e8b7ce31--