incubator-hama-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Thomas Jungblut (JIRA)" <j...@apache.org>
Subject [jira] [Issue Comment Edited] (HAMA-359) Development of Shortest Path Finding Algorithm
Date Thu, 12 May 2011 22:23:47 GMT

    [ https://issues.apache.org/jira/browse/HAMA-359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13032694#comment-13032694
] 

Thomas Jungblut edited comment on HAMA-359 at 5/12/11 10:22 PM:
----------------------------------------------------------------

Now on 6 Nodes:

Now I faced the haging with the standard example. But that is really random, It's in 1 of
5 cases...
And the same with bench 512 50 1000. Even during: 

{noformat}hama/bin/hama jar /usr/local/hama/hama-0.3.0-examples.jar test
11/05/12 23:48:12 INFO bsp.BSPJobClient: Running job: job_201105122340_0003
11/05/12 23:48:15 INFO bsp.BSPJobClient: Current supersteps number: 0
11/05/12 23:48:18 INFO bsp.BSPJobClient: Current supersteps number: 2
... [Hangs forever]
{noformat}

Could this be a problem of fault tolerance? Like one groom isn't responsive anymore and the
others are just waiting for this groom to reach the barrier?

Another observation, in the long sequencefile SSSP some crucial slowdowns occus between Superstep
15 and 18. 
Using TCPDUMP shows that there is actually a communication, but with a lot of small packets
(length 12 - 20)and it takes a long time. 
{noformat}
11/05/13 00:01:32 INFO bsp.BSPJobClient: Running job: job_201105122350_0004
11/05/13 00:01:35 INFO bsp.BSPJobClient: Current supersteps number: 0
11/05/13 00:02:44 INFO bsp.BSPJobClient: Current supersteps number: 9
11/05/13 00:02:53 INFO bsp.BSPJobClient: Current supersteps number: 10
11/05/13 00:02:56 INFO bsp.BSPJobClient: Current supersteps number: 12
11/05/13 00:03:50 INFO bsp.BSPJobClient: Current supersteps number: 13
11/05/13 00:04:17 INFO bsp.BSPJobClient: Current supersteps number: 15
11/05/13 00:14:36 INFO bsp.BSPJobClient: Current supersteps number: 16
11/05/13 00:19:30 INFO bsp.BSPJobClient: Current supersteps number: 18
{noformat}

Picked the communication between two of six nodes:

{noformat}
00:20:42.160759 IP raynor.21810 > zeratul.37758: Flags [P.], seq 1:21, ack 12, win 46,
options [nop,nop,TS val 501715 ecr 304403], length 20
00:20:42.160928 IP zeratul.37758 > raynor.21810: Flags [.], ack 21, win 501, options [nop,nop,TS
val 304403 ecr 501715], length 0
00:20:44.169980 IP zeratul.37758 > raynor.21810: Flags [P.], seq 12:24, ack 21, win 501,
options [nop,nop,TS val 304604 ecr 501715], length 12
00:20:44.170347 IP raynor.21810 > zeratul.37758: Flags [P.], seq 21:41, ack 24, win 46,
options [nop,nop,TS val 501916 ecr 304604], length 20
00:20:44.170748 IP zeratul.37758 > raynor.21810: Flags [.], ack 41, win 501, options [nop,nop,TS
val 304604 ecr 501916], length 0
00:20:46.170129 IP zeratul.37758 > raynor.21810: Flags [P.], seq 24:36, ack 41, win 501,
options [nop,nop,TS val 304804 ecr 501916], length 12
00:20:46.170867 IP raynor.21810 > zeratul.37758: Flags [P.], seq 41:61, ack 36, win 46,
options [nop,nop,TS val 502116 ecr 304804], length 20
00:20:46.171227 IP zeratul.37758 > raynor.21810: Flags [.], ack 61, win 501, options [nop,nop,TS
val 304804 ecr 502116], length 0
00:20:48.170054 IP zeratul.37758 > raynor.21810: Flags [P.], seq 36:48, ack 61, win 501,
options [nop,nop,TS val 305004 ecr 502116], length 12
00:20:48.170536 IP raynor.21810 > zeratul.37758: Flags [P.], seq 61:81, ack 48, win 46,
options [nop,nop,TS val 502316 ecr 305004], length 20
00:20:48.170959 IP zeratul.37758 > raynor.21810: Flags [.], ack 81, win 501, options [nop,nop,TS
val 305004 ecr 502316], length 0
{noformat}



      was (Author: thomas.jungblut):
    Now on 6 Nodes:

Now I faced the haging with the standard example. But that is really random, It's in 1 of
5 cases...
And the same with bench 512 50 1000. Even during: 

{noformat}hama/bin/hama jar /usr/local/hama/hama-0.3.0-examples.jar test
11/05/12 23:48:12 INFO bsp.BSPJobClient: Running job: job_201105122340_0003
11/05/12 23:48:15 INFO bsp.BSPJobClient: Current supersteps number: 0
11/05/12 23:48:18 INFO bsp.BSPJobClient: Current supersteps number: 2
... [Hangs forever]
{noformat}

Could this be a problem of fault tolerance? Like one groom isn't responsive anymore and the
others are just waiting for this groom to reach the barrier?

  
> Development of Shortest Path Finding Algorithm
> ----------------------------------------------
>
>                 Key: HAMA-359
>                 URL: https://issues.apache.org/jira/browse/HAMA-359
>             Project: Hama
>          Issue Type: New Feature
>          Components: examples
>    Affects Versions: 0.2.0
>            Reporter: Edward J. Yoon
>            Assignee: Thomas Jungblut
>              Labels: gsoc, gsoc2011, mentor
>             Fix For: 0.3.0
>
>         Attachments: HAMA-359-v2.patch, HAMA-359-v3.patch, HAMA-359.patch, eddie.patch
>
>   Original Estimate: 2016h
>  Remaining Estimate: 2016h
>
> The goal of this project is development of parallel algorithm for finding a Shortest
Path using Hama BSP.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message