Return-Path: X-Original-To: apmail-incubator-hama-dev-archive@minotaur.apache.org Delivered-To: apmail-incubator-hama-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 0B47654E8 for ; Thu, 12 May 2011 22:40:28 +0000 (UTC) Received: (qmail 34112 invoked by uid 500); 12 May 2011 22:40:27 -0000 Delivered-To: apmail-incubator-hama-dev-archive@incubator.apache.org Received: (qmail 34094 invoked by uid 500); 12 May 2011 22:40:27 -0000 Mailing-List: contact hama-dev-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hama-dev@incubator.apache.org Delivered-To: mailing list hama-dev@incubator.apache.org Received: (qmail 34086 invoked by uid 99); 12 May 2011 22:40:27 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 12 May 2011 22:40:27 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=5.0 tests=ALL_TRUSTED,T_RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 12 May 2011 22:40:26 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id 5558D87009 for ; Thu, 12 May 2011 22:39:47 +0000 (UTC) Date: Thu, 12 May 2011 22:39:47 +0000 (UTC) From: "Thomas Jungblut (JIRA)" To: hama-dev@incubator.apache.org Message-ID: <1653614906.8440.1305239987331.JavaMail.tomcat@hel.zones.apache.org> In-Reply-To: <1094136074.652.1299492899376.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] [Issue Comment Edited] (HAMA-359) Development of Shortest Path Finding Algorithm MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HAMA-359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13032694#comment-13032694 ] Thomas Jungblut edited comment on HAMA-359 at 5/12/11 10:39 PM: ---------------------------------------------------------------- Now on 6 Nodes: Now I faced the haging with the standard example. But that is really random, It's in 1 of 5 cases... And the same with bench 512 50 1000. Even during: {noformat}hama/bin/hama jar /usr/local/hama/hama-0.3.0-examples.jar test 11/05/12 23:48:12 INFO bsp.BSPJobClient: Running job: job_201105122340_0003 11/05/12 23:48:15 INFO bsp.BSPJobClient: Current supersteps number: 0 11/05/12 23:48:18 INFO bsp.BSPJobClient: Current supersteps number: 2 ... [Hangs forever] {noformat} Could this be a problem of fault tolerance? Like one groom isn't responsive anymore and the others are just waiting for this groom to reach the barrier? Another observation, in the long sequencefile SSSP some crucial slowdowns occus between Superstep 15 and 18. Using TCPDUMP shows that there is actually a communication, but with a lot of small packets (length 12 - 20)and it takes a long time. {noformat} 11/05/13 00:01:32 INFO bsp.BSPJobClient: Running job: job_201105122350_0004 11/05/13 00:01:35 INFO bsp.BSPJobClient: Current supersteps number: 0 11/05/13 00:02:44 INFO bsp.BSPJobClient: Current supersteps number: 9 11/05/13 00:02:53 INFO bsp.BSPJobClient: Current supersteps number: 10 11/05/13 00:02:56 INFO bsp.BSPJobClient: Current supersteps number: 12 11/05/13 00:03:50 INFO bsp.BSPJobClient: Current supersteps number: 13 11/05/13 00:04:17 INFO bsp.BSPJobClient: Current supersteps number: 15 11/05/13 00:14:36 INFO bsp.BSPJobClient: Current supersteps number: 16 11/05/13 00:19:30 INFO bsp.BSPJobClient: Current supersteps number: 18 {noformat} Picked the communication between two of six nodes: {noformat} 00:20:42.160759 IP raynor.21810 > zeratul.37758: Flags [P.], seq 1:21, ack 12, win 46, options [nop,nop,TS val 501715 ecr 304403], length 20 00:20:42.160928 IP zeratul.37758 > raynor.21810: Flags [.], ack 21, win 501, options [nop,nop,TS val 304403 ecr 501715], length 0 00:20:44.169980 IP zeratul.37758 > raynor.21810: Flags [P.], seq 12:24, ack 21, win 501, options [nop,nop,TS val 304604 ecr 501715], length 12 00:20:44.170347 IP raynor.21810 > zeratul.37758: Flags [P.], seq 21:41, ack 24, win 46, options [nop,nop,TS val 501916 ecr 304604], length 20 00:20:44.170748 IP zeratul.37758 > raynor.21810: Flags [.], ack 41, win 501, options [nop,nop,TS val 304604 ecr 501916], length 0 00:20:46.170129 IP zeratul.37758 > raynor.21810: Flags [P.], seq 24:36, ack 41, win 501, options [nop,nop,TS val 304804 ecr 501916], length 12 00:20:46.170867 IP raynor.21810 > zeratul.37758: Flags [P.], seq 41:61, ack 36, win 46, options [nop,nop,TS val 502116 ecr 304804], length 20 00:20:46.171227 IP zeratul.37758 > raynor.21810: Flags [.], ack 61, win 501, options [nop,nop,TS val 304804 ecr 502116], length 0 00:20:48.170054 IP zeratul.37758 > raynor.21810: Flags [P.], seq 36:48, ack 61, win 501, options [nop,nop,TS val 305004 ecr 502116], length 12 00:20:48.170536 IP raynor.21810 > zeratul.37758: Flags [P.], seq 61:81, ack 48, win 46, options [nop,nop,TS val 502316 ecr 305004], length 20 00:20:48.170959 IP zeratul.37758 > raynor.21810: Flags [.], ack 81, win 501, options [nop,nop,TS val 305004 ecr 502316], length 0 {noformat} Solution: They messages are in a queue, so I implement a list writable that will batch these vertices together. I hope this will result in a far better runner time... We should take this experience into consideration for the other GSOC task. was (Author: thomas.jungblut): Now on 6 Nodes: Now I faced the haging with the standard example. But that is really random, It's in 1 of 5 cases... And the same with bench 512 50 1000. Even during: {noformat}hama/bin/hama jar /usr/local/hama/hama-0.3.0-examples.jar test 11/05/12 23:48:12 INFO bsp.BSPJobClient: Running job: job_201105122340_0003 11/05/12 23:48:15 INFO bsp.BSPJobClient: Current supersteps number: 0 11/05/12 23:48:18 INFO bsp.BSPJobClient: Current supersteps number: 2 ... [Hangs forever] {noformat} Could this be a problem of fault tolerance? Like one groom isn't responsive anymore and the others are just waiting for this groom to reach the barrier? Another observation, in the long sequencefile SSSP some crucial slowdowns occus between Superstep 15 and 18. Using TCPDUMP shows that there is actually a communication, but with a lot of small packets (length 12 - 20)and it takes a long time. {noformat} 11/05/13 00:01:32 INFO bsp.BSPJobClient: Running job: job_201105122350_0004 11/05/13 00:01:35 INFO bsp.BSPJobClient: Current supersteps number: 0 11/05/13 00:02:44 INFO bsp.BSPJobClient: Current supersteps number: 9 11/05/13 00:02:53 INFO bsp.BSPJobClient: Current supersteps number: 10 11/05/13 00:02:56 INFO bsp.BSPJobClient: Current supersteps number: 12 11/05/13 00:03:50 INFO bsp.BSPJobClient: Current supersteps number: 13 11/05/13 00:04:17 INFO bsp.BSPJobClient: Current supersteps number: 15 11/05/13 00:14:36 INFO bsp.BSPJobClient: Current supersteps number: 16 11/05/13 00:19:30 INFO bsp.BSPJobClient: Current supersteps number: 18 {noformat} Picked the communication between two of six nodes: {noformat} 00:20:42.160759 IP raynor.21810 > zeratul.37758: Flags [P.], seq 1:21, ack 12, win 46, options [nop,nop,TS val 501715 ecr 304403], length 20 00:20:42.160928 IP zeratul.37758 > raynor.21810: Flags [.], ack 21, win 501, options [nop,nop,TS val 304403 ecr 501715], length 0 00:20:44.169980 IP zeratul.37758 > raynor.21810: Flags [P.], seq 12:24, ack 21, win 501, options [nop,nop,TS val 304604 ecr 501715], length 12 00:20:44.170347 IP raynor.21810 > zeratul.37758: Flags [P.], seq 21:41, ack 24, win 46, options [nop,nop,TS val 501916 ecr 304604], length 20 00:20:44.170748 IP zeratul.37758 > raynor.21810: Flags [.], ack 41, win 501, options [nop,nop,TS val 304604 ecr 501916], length 0 00:20:46.170129 IP zeratul.37758 > raynor.21810: Flags [P.], seq 24:36, ack 41, win 501, options [nop,nop,TS val 304804 ecr 501916], length 12 00:20:46.170867 IP raynor.21810 > zeratul.37758: Flags [P.], seq 41:61, ack 36, win 46, options [nop,nop,TS val 502116 ecr 304804], length 20 00:20:46.171227 IP zeratul.37758 > raynor.21810: Flags [.], ack 61, win 501, options [nop,nop,TS val 304804 ecr 502116], length 0 00:20:48.170054 IP zeratul.37758 > raynor.21810: Flags [P.], seq 36:48, ack 61, win 501, options [nop,nop,TS val 305004 ecr 502116], length 12 00:20:48.170536 IP raynor.21810 > zeratul.37758: Flags [P.], seq 61:81, ack 48, win 46, options [nop,nop,TS val 502316 ecr 305004], length 20 00:20:48.170959 IP zeratul.37758 > raynor.21810: Flags [.], ack 81, win 501, options [nop,nop,TS val 305004 ecr 502316], length 0 {noformat} > Development of Shortest Path Finding Algorithm > ---------------------------------------------- > > Key: HAMA-359 > URL: https://issues.apache.org/jira/browse/HAMA-359 > Project: Hama > Issue Type: New Feature > Components: examples > Affects Versions: 0.2.0 > Reporter: Edward J. Yoon > Assignee: Thomas Jungblut > Labels: gsoc, gsoc2011, mentor > Fix For: 0.3.0 > > Attachments: HAMA-359-v2.patch, HAMA-359-v3.patch, HAMA-359.patch, eddie.patch > > Original Estimate: 2016h > Remaining Estimate: 2016h > > The goal of this project is development of parallel algorithm for finding a Shortest Path using Hama BSP. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira