hama-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chia-Hung Lin <cli...@googlemail.com>
Subject Re: Zookeeper - Problems?
Date Sun, 16 Jun 2013 14:44:31 GMT
Have you checked if underlying network traffic is busy when error happens?

Can't be very sure but the symptom seems to be the heavy network
traffic leads to the zk connection lost.



On 16 June 2013 20:22, Sascha Jonas <sascha.jonas@student.htw-berlin.de> wrote:
> Hey,
>
> iam using Apache Hama on a small cluster with two computers. Its working
> fine with a small number of supersteps but every time i am trying with
> lots of iterations e.g. 10000 it crashes.
>
> Right now it stopped working after 4600 supersteps. 8 from 16 Tasks are
> still running while the log shows some errors.
>
> Iam using Apache Hama 0.6 and the builtin Zookeeper. Should i go with a
> newer Hama or Zookeeper version?
>
> 13/06/16 00:14:14 ERROR sync.ZKSyncClient: Error creating zk path
> /bsp/job_201306091733_0009/sync/4276
> org.apache.zookeeper.KeeperException$ConnectionLossException:
> KeeperErrorCode = ConnectionLoss for /bsp/job_201306091733_0009/sync/4276
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>         at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783)
>         at org.apache.hama.bsp.sync.ZKSyncClient.createZnode(ZKSyncClient.java:138)
>         at org.apache.hama.bsp.sync.ZKSyncClient.writeNode(ZKSyncClient.java:290)
>         at
> org.apache.hama.bsp.sync.ZooKeeperSyncClientImpl.enterBarrier(ZooKeeperSyncClientImpl.java:99)
>         at org.apache.hama.bsp.BSPPeerImpl.enterBarrier(BSPPeerImpl.java:474)
>         at org.apache.hama.bsp.BSPPeerImpl.sync(BSPPeerImpl.java:428)
>         at
> de.distMLP.Base_MLP_Trainer.calculateAndWriteCost(Base_MLP_Trainer.java:90)
>         at
> de.distMLP.Train_MultilayerPerceptron$MultilayerPerceptron_Trainer.bsp(Train_MultilayerPerceptron.java:57)
>         at org.apache.hama.bsp.BSPTask.runBSP(BSPTask.java:168)
>         at org.apache.hama.bsp.BSPTask.run(BSPTask.java:144)
>         at org.apache.hama.bsp.GroomServer$BSPPeerChild.main(GroomServer.java:1262)
> 13/06/16 00:14:15 ERROR
> distMLP.Train_MultilayerPerceptron$MultilayerPerceptron_Trainer:
> org.apache.hama.bsp.sync.SyncException
> org.apache.hama.bsp.sync.SyncException
>         at
> org.apache.hama.bsp.sync.ZooKeeperSyncClientImpl.enterBarrier(ZooKeeperSyncClientImpl.java:137)
>         at org.apache.hama.bsp.BSPPeerImpl.enterBarrier(BSPPeerImpl.java:474)
>         at org.apache.hama.bsp.BSPPeerImpl.sync(BSPPeerImpl.java:428)
>         at
> de.distMLP.Base_MLP_Trainer.calculateAndWriteCost(Base_MLP_Trainer.java:90)
>         at
> de.distMLP.Train_MultilayerPerceptron$MultilayerPerceptron_Trainer.bsp(Train_MultilayerPerceptron.java:57)
>         at org.apache.hama.bsp.BSPTask.runBSP(BSPTask.java:168)
>         at org.apache.hama.bsp.BSPTask.run(BSPTask.java:144)
>         at org.apache.hama.bsp.GroomServer$BSPPeerChild.main(GroomServer.java:1262)
>

Mime
View raw message