hama-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Edward J. Yoon (JIRA)" <j...@apache.org>
Subject [jira] [Resolved] (HAMA-973) GraphJob and RandBench example works incorrectly when FT is enabled.
Date Tue, 08 Sep 2015 01:56:45 GMT

     [ https://issues.apache.org/jira/browse/HAMA-973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Edward J. Yoon resolved HAMA-973.
---------------------------------
       Resolution: Fixed
    Fix Version/s: 0.7.1

> GraphJob and RandBench example works incorrectly when FT is enabled.
> --------------------------------------------------------------------
>
>                 Key: HAMA-973
>                 URL: https://issues.apache.org/jira/browse/HAMA-973
>             Project: Hama
>          Issue Type: Bug
>          Components: bsp core
>    Affects Versions: 0.7.0
>            Reporter: Edward J. Yoon
>            Assignee: Edward J. Yoon
>            Priority: Critical
>             Fix For: 0.7.1
>
>         Attachments: patch.txt
>
>
> Today I tested fault tolerance function with RandBench. FT works fine but I just found
that there is a bug in RandBench program.
> {code}
> [root@cluster-0 hama-0.7.0]# bin/hama jar hama-examples-0.7.0.jar bench 100 100 100
> 15/09/03 12:59:57 WARN util.NativeCodeLoader: Unable to load native-hadoop library for
your platform... using builtin-java classes where applicable
> 15/09/03 12:59:58 INFO Configuration.deprecation: user.name is deprecated. Instead, use
mapreduce.job.user.name
> 15/09/03 12:59:58 INFO bsp.BSPJobClient: Running job: job_201509031258_0002
> 15/09/03 13:00:01 INFO bsp.BSPJobClient: Current supersteps number: 0
> 15/09/03 13:00:22 INFO bsp.BSPJobClient: Current supersteps number: 2
> 15/09/03 13:00:26 INFO bsp.BSPJobClient: Current supersteps number: 5
> 15/09/03 13:00:29 INFO bsp.BSPJobClient: Current supersteps number: 11
> 15/09/03 13:00:32 INFO bsp.BSPJobClient: Current supersteps number: 16
> 15/09/03 13:00:35 INFO bsp.BSPJobClient: Current supersteps number: 21
> 15/09/03 13:00:38 INFO bsp.BSPJobClient: Current supersteps number: 28
> 15/09/03 13:00:41 INFO bsp.BSPJobClient: Current supersteps number: 35
> 15/09/03 13:00:44 INFO bsp.BSPJobClient: Current supersteps number: 42
> 15/09/03 13:00:47 INFO bsp.BSPJobClient: Current supersteps number: 49
> 15/09/03 13:00:50 INFO bsp.BSPJobClient: Current supersteps number: 56
> 15/09/03 13:02:05 INFO bsp.BSPJobClient: Current supersteps number: 0
> 15/09/03 13:02:08 INFO bsp.BSPJobClient: Current supersteps number: 56
> 15/09/03 13:02:11 INFO bsp.BSPJobClient: Current supersteps number: 0
> 15/09/03 13:02:20 INFO bsp.BSPJobClient: Current supersteps number: 57
> 15/09/03 13:02:23 INFO bsp.BSPJobClient: Current supersteps number: 61
> 15/09/03 13:02:26 INFO bsp.BSPJobClient: Current supersteps number: 67
> 15/09/03 13:02:29 INFO bsp.BSPJobClient: Current supersteps number: 72
> 15/09/03 13:02:32 INFO bsp.BSPJobClient: Current supersteps number: 77
> 15/09/03 13:02:35 INFO bsp.BSPJobClient: Current supersteps number: 84
> 15/09/03 13:02:38 INFO bsp.BSPJobClient: Current supersteps number: 91
> 15/09/03 13:02:41 INFO bsp.BSPJobClient: Current supersteps number: 97
> 15/09/03 13:02:44 INFO bsp.BSPJobClient: Current supersteps number: 106
> 15/09/03 13:02:47 INFO bsp.BSPJobClient: Current supersteps number: 113
> 15/09/03 13:02:50 INFO bsp.BSPJobClient: Current supersteps number: 125
> 15/09/03 13:02:53 INFO bsp.BSPJobClient: Current supersteps number: 134
> 15/09/03 13:02:56 INFO bsp.BSPJobClient: Current supersteps number: 144
> 15/09/03 13:02:59 INFO bsp.BSPJobClient: Current supersteps number: 152
> 15/09/03 13:03:02 INFO bsp.BSPJobClient: Current supersteps number: 156
> 15/09/03 13:03:05 INFO bsp.BSPJobClient: The total number of supersteps: 156
> 15/09/03 13:03:05 INFO bsp.BSPJobClient: Counters: 6
> 15/09/03 13:03:05 INFO bsp.BSPJobClient:   org.apache.hama.bsp.JobInProgress$JobCounter
> 15/09/03 13:03:05 INFO bsp.BSPJobClient:     SUPERSTEPS=156
> 15/09/03 13:03:05 INFO bsp.BSPJobClient:     LAUNCHED_TASKS=160
> 15/09/03 13:03:05 INFO bsp.BSPJobClient:   org.apache.hama.bsp.BSPPeerImpl$PeerCounter
> 15/09/03 13:03:05 INFO bsp.BSPJobClient:     SUPERSTEP_SUM=24960
> 15/09/03 13:03:05 INFO bsp.BSPJobClient:     TIME_IN_SYNC_MS=1943366
> 15/09/03 13:03:05 INFO bsp.BSPJobClient:     TOTAL_MESSAGES_SENT=1600000
> 15/09/03 13:03:05 INFO bsp.BSPJobClient:     TOTAL_MESSAGES_RECEIVED=1600000
> Job Finished in 187.453 seconds
> {code}
> I ran with set the max iteration to 100. At 56 superstep, I killed one task manually
and I checked that failed task has automatically recovered. By the way, the total num of supersteps
was 156, not 100.
> The reason is simple, i always starts from 0. To fix this issue, we have to set the i
to (int) peer.getSuperstepCount().
> {code}
>     public void bsp(
>         BSPPeer<NullWritable, NullWritable, NullWritable, NullWritable, BytesWritable>
peer)
>         throws IOException, SyncException, InterruptedException {
>       byte[] dummyData = new byte[sizeOfMsg];
>       String[] peers = peer.getAllPeerNames();
>       for (int i = 0; i < nSupersteps; i++) {
> {code}
> GraphJobRunner also have similar problem. When the task is relaunched, setup() method
will be called. Below should be called only when initial phase.
> {code}
>     long startTime = System.currentTimeMillis();
>     loadVertices(peer);
>     LOG.info("Total time spent for loading vertices: "
>         + (System.currentTimeMillis() - startTime) + " ms");
>     startTime = System.currentTimeMillis();
>     countGlobalVertexCount(peer);
>     LOG.info("Total time spent for broadcasting global vertex count: "
>         + (System.currentTimeMillis() - startTime) + " ms");
>     startTime = System.currentTimeMillis();
>     doInitialSuperstep(peer);
>     LOG.info("Total time spent for initial superstep: "
>         + (System.currentTimeMillis() - startTime) + " ms");
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message