hama-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Edward J. Yoon (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HAMA-387) Advanced Barrier Synchronization
Date Mon, 19 Sep 2011 00:44:09 GMT

    [ https://issues.apache.org/jira/browse/HAMA-387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13107570#comment-13107570
] 

Edward J. Yoon commented on HAMA-387:
-------------------------------------

Job hangs again in the patch test.

{code}
root@Cnode1:/usr/local/src/hama-trunk# core/bin/hama jar examples/target/hama-exampleSNAPSHOT.jar
bench 160 10000 64
11/09/19 09:34:31 DEBUG bsp.BSPJobClient: BSPJobClient.submitJobDir: hdfs://hnode15:9/bsp/system/submit_z5c7vt
11/09/19 09:34:31 INFO bsp.BSPJobClient: Running job: job_201109190912_0005
11/09/19 09:34:34 INFO bsp.BSPJobClient: Current supersteps number: 0
11/09/19 09:34:40 INFO bsp.BSPJobClient: Current supersteps number: 1
11/09/19 09:34:43 INFO bsp.BSPJobClient: Current supersteps number: 3
11/09/19 09:34:46 INFO bsp.BSPJobClient: Current supersteps number: 5
11/09/19 09:34:49 INFO bsp.BSPJobClient: Current supersteps number: 6
11/09/19 09:34:52 INFO bsp.BSPJobClient: Current supersteps number: 8
11/09/19 09:34:55 INFO bsp.BSPJobClient: Current supersteps number: 10
11/09/19 09:34:58 INFO bsp.BSPJobClient: Current supersteps number: 12
11/09/19 09:35:01 INFO bsp.BSPJobClient: Current supersteps number: 13
11/09/19 09:35:04 INFO bsp.BSPJobClient: Current supersteps number: 14

----

2011-09-19 09:35:07,480 INFO org.apache.hama.bsp.TaskRunner: attempt_201109190912_0005_000005_0
11/09/19 09:35:07 INFO bsp.BSPPeer: xxxx enterBarrier() list.size():45 children in the list:[attempt_201109190912_0005_000020_0,
attempt_201109190912_0005_000005_0, attempt_201109190912_0005_000030_0, attempt_201109190912_0005_000021_0,
attempt_201109190912_0005_000023_0, attempt_201109190912_0005_000004_0, attempt_201109190912_0005_000010_0,
attempt_201109190912_0005_000014_0, attempt_201109190912_0005_000015_0, attempt_201109190912_0005_000039_0,
attempt_201109190912_0005_000006_0, attempt_201109190912_0005_000007_0, attempt_201109190912_0005_000019_0,
attempt_201109190912_0005_000044_0, attempt_201109190912_0005_000024_0, attempt_201109190912_0005_000013_0,
attempt_201109190912_0005_000025_0, attempt_201109190912_0005_000016_0, attempt_201109190912_0005_000034_0,
attempt_201109190912_0005_000042_0, attempt_201109190912_0005_000026_0, attempt_201109190912_0005_000035_0,
attempt_201109190912_0005_000008_0, attempt_201109190912_0005_000018_0, attempt_201109190912_0005_000033_0,
attempt_201109190912_0005_000009_0, attempt_201109190912_0005_000002_0, attempt_201109190912_0005_000041_0,
attempt_201109190912_0005_000036_0, attempt_201109190912_0005_000012_0, attempt_201109190912_0005_000003_0,
attempt_201109190912_0005_000011_0, attempt_201109190912_0005_000038_0, attempt_201109190912_0005_000029_0,
attempt_201109190912_0005_000028_0, attempt_201109190912_0005_000040_0, attempt_201109190912_0005_000017_0,
attempt_201109190912_0005_000043_0, attempt_201109190912_0005_000027_0, attempt_201109190912_0005_000000_0,
attempt_201109190912_0005_000001_0, attempt_201109190912_0005_000031_0, attempt_201109190912_0005_000037_0,
attempt_201109190912_0005_000022_0, attempt_201109190912_0005_000032_0]
2011-09-19 09:35:07,480 INFO org.apache.hama.bsp.TaskRunner: attempt_201109190912_0005_000005_0
11/09/19 09:35:07 INFO bsp.BSPPeer: =====> jobid:job_201109190912_0005 taskid:attempt_201109190912_0005_000005_0
after enterBarrier()
2011-09-19 09:35:07,480 INFO org.apache.hama.bsp.TaskRunner: attempt_201109190912_0005_000003_0
11/09/19 09:35:07 INFO bsp.BSPPeer: =====> jobid:job_201109190912_0005 taskid:attempt_201109190912_0005_000003_0
after enterBarrier()
2011-09-19 09:35:07,480 INFO org.apache.hama.bsp.TaskRunner: attempt_201109190912_0005_000005_0
11/09/19 09:35:07 INFO bsp.BSPPeer: =====> jobid:job_201109190912_0005 taskid:attempt_201109190912_0005_000005_0
before leaveBarrier()
2011-09-19 09:35:07,480 INFO org.apache.hama.bsp.TaskRunner: attempt_201109190912_0005_000005_0
11/09/19 09:35:07 INFO bsp.BSPPeer: xxxxx leaveBarrier() list.size:11 children in the list[attempt_201109190912_0005_000007_0,
attempt_201109190912_0005_000044_0, attempt_201109190912_0005_000018_0, attempt_201109190912_0005_000009_0,
attempt_201109190912_0005_000041_0, attempt_201109190912_0005_000003_0, attempt_201109190912_0005_000011_0,
attempt_201109190912_0005_000028_0, attempt_201109190912_0005_000027_0, attempt_201109190912_0005_000000_0,
attempt_201109190912_0005_000001_0]
2011-09-19 09:35:07,480 INFO org.apache.hama.bsp.TaskRunner: attempt_201109190912_0005_000001_0
11/09/19 09:35:07 INFO bsp.BSPPeer: xxxx enterBarrier() list.size():11 children in the list:[attempt_201109190912_0005_000007_0,
attempt_201109190912_0005_000044_0, attempt_201109190912_0005_000018_0, attempt_201109190912_0005_000009_0,
attempt_201109190912_0005_000041_0, attempt_201109190912_0005_000003_0, attempt_201109190912_0005_000011_0,
attempt_201109190912_0005_000028_0, attempt_201109190912_0005_000027_0, attempt_201109190912_0005_000000_0,
attempt_201109190912_0005_000001_0]
2011-09-19 09:35:07,617 INFO org.apache.hama.bsp.TaskRunner: attempt_201109190912_0005_000003_0
11/09/19 09:35:07 INFO bsp.BSPPeer: =====> jobid:job_201109190912_0005 taskid:attempt_201109190912_0005_000003_0
before leaveBarrier()
2011-09-19 09:35:07,661 INFO org.apache.hama.bsp.TaskRunner: attempt_201109190912_0005_000003_0
11/09/19 09:35:07 INFO bsp.BSPPeer: xxxxx leaveBarrier() list.size:3 children in the list[attempt_201109190912_0005_000028_0,
attempt_201109190912_0005_000027_0, attempt_201109190912_0005_000001_0]
2011-09-19 09:35:07,661 INFO org.apache.hama.bsp.TaskRunner: attempt_201109190912_0005_000001_0
11/09/19 09:35:07 INFO bsp.BSPPeer: xxxx enterBarrier() list.size():3 children in the list:[attempt_201109190912_0005_000028_0,
attempt_201109190912_0005_000027_0, attempt_201109190912_0005_000001_0]
2011-09-19 09:35:07,661 INFO org.apache.hama.bsp.TaskRunner: attempt_201109190912_0005_000005_0
11/09/19 09:35:07 INFO bsp.BSPPeer: xxxxx leaveBarrier() list.size:3 children in the list[attempt_201109190912_0005_000028_0,
attempt_201109190912_0005_000027_0, attempt_201109190912_0005_000001_0]
2011-09-19 09:35:07,836 INFO org.apache.hama.bsp.TaskRunner: attempt_201109190912_0005_000003_0
11/09/19 09:35:07 INFO bsp.BSPPeer: xxxxx leaveBarrier() list.size:1 children in the list[attempt_201109190912_0005_000001_0]
2011-09-19 09:35:07,836 INFO org.apache.hama.bsp.TaskRunner: attempt_201109190912_0005_000001_0
11/09/19 09:35:07 INFO bsp.BSPPeer: xxxx enterBarrier() list.size():1 children in the list:[attempt_201109190912_0005_000001_0]
2011-09-19 09:35:07,836 INFO org.apache.hama.bsp.TaskRunner: attempt_201109190912_0005_000005_0
11/09/19 09:35:07 INFO bsp.BSPPeer: xxxxx leaveBarrier() list.size:1 children in the list[attempt_201109190912_0005_000001_0]
{code}

> Advanced Barrier Synchronization
> --------------------------------
>
>                 Key: HAMA-387
>                 URL: https://issues.apache.org/jira/browse/HAMA-387
>             Project: Hama
>          Issue Type: Improvement
>          Components: bsp
>    Affects Versions: 0.3.0
>            Reporter: Edward J. Yoon
>            Assignee: Edward J. Yoon
>             Fix For: 0.4.0
>
>         Attachments: HAMA-387_v02.patch, HAMA-387_v03.patch, HAMA-387_v04.patch, new.patch,
sleepless.patch, x.PNG, x.patch
>
>
> I think, the lock file must include:
>  * the job ID
>  * the task ID of the lock file owner
>  * the current superstep count
> to check ownership and validation.
> Currently they are named by hostname, but multi-tasks can be run per one groomserver
in the future. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

Mime
View raw message