giraph-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sertu─č Kaya <sertug.k...@agmlab.com>
Subject Some of the workers hanging while others finish
Date Fri, 08 Nov 2013 16:24:37 GMT
Hi all;

I have a 10 machine cluster; 8 Tasktrackers, 1 namenode and 1 
Jobtracker. All of them have 8GB memory.
When I run a giraph job with 16 workers on a hbase table with 15 
millions of rows (12GB of data) my job does not finish due to 2 
(sometimes 3) workers stuck at superstep -1. Below you can see the logs 
of these workers. Any helps/ideas?
Thanks
Sertug

> *_syslog logs_*
> 2013-11-08 19:00:06,576 INFO org.apache.hadoop.util.NativeCodeLoader: Loaded the native-hadoop
library
> 2013-11-08 19:00:07,449 INFO org.apache.zookeeper.ZooKeeper: Client environment:zookeeper.version=3.3.3-1073969,
built on 02/23/2011 22:27 GMT
> 2013-11-08 19:00:07,449 INFO org.apache.zookeeper.ZooKeeper: Client environment:host.name=host-10-6-149-104
> 2013-11-08 19:00:07,449 INFO org.apache.zookeeper.ZooKeeper: Client environment:java.version=1.7.0_21
> 2013-11-08 19:00:07,449 INFO org.apache.zookeeper.ZooKeeper: Client environment:java.vendor=Oracle
Corporation
> 2013-11-08 19:00:07,449 INFO org.apache.zookeeper.ZooKeeper: Client environment:java.home=/usr/java/jdk1.7.0_21/jre
> 2013-11-08 19:00:07,449 INFO org.apache.zookeeper.ZooKeeper: Client environment:java.class.path=/hadoop/mapred/taskTracker/root/jobcache/job_201311081157_0003/jars/classes:/hadoop/mapred/taskTracker/root/jobcache/job_201311081157_0003/jars:/hadoop/mapred/taskTracker/root/jobcache/job_201311081157_0003/attempt_201311081157_0003_m_000009_0/work:/usr/local/hadoop-1.2.1/libexec/../conf:/usr/java/default/lib/tools.jar:/usr/local/hadoop-1.2.1/libexec/..:/usr/local/hadoop-1.2.1/libexec/../hadoop-core-1.2.1.jar:/usr/local/hadoop-1.2.1/libexec/../lib/asm-3.2.jar:/usr/local/hadoop-1.2.1/libexec/../lib/aspectjrt-1.6.11.jar:/usr/local/hadoop-1.2.1/libexec/../lib/aspectjtools-1.6.11.jar:/usr/local/hadoop-1.2.1/libexec/../lib/commons-beanutils-1.7.0.jar:/usr/local/hadoop-1.2.1/libexec/../lib/commons-beanutils-core-1.8.0.jar:/usr/local/hadoop-1.2.1/libexec/../lib/commons-cli-1.2.jar:/usr/local/hadoop-1.2.1/libexec/../lib/commons-codec-1.4.jar:/usr/local/hadoop-1.2.1/libexec/../lib/commons-collections-3.2.1.jar:/usr/local/hadoop-1.2.1/libexec/../lib/commons-configuration-1.6.jar:/usr/local/hadoop-1.2.1/libexec/../lib/commons-daemon-1.0.1.jar:/usr/local/hadoop-1.2.1/libexec/../lib/commons-digester-1.8.jar:/usr/local/hadoop-1.2.1/libexec/../lib/commons-el-1.0.jar:/usr/local/hadoop-1.2.1/libexec/../lib/commons-httpclient-3.0.1.jar:/usr/local/hadoop-1.2.1/libexec/../lib/commons-io-2.1.jar:/usr/local/hadoop-1.2.1/libexec/../lib/commons-lang-2.4.jar:/usr/local/hadoop-1.2.1/libexec/../lib/commons-logging-1.1.1.jar:/usr/local/hadoop-1.2.1/libexec/../lib/commons-logging-api-1.0.4.jar:/usr/local/hadoop-1.2.1/libexec/../lib/commons-math-2.1.jar:/usr/local/hadoop-1.2.1/libexec/../lib/commons-net-3.1.jar:/usr/local/hadoop-1.2.1/libexec/../lib/core-3.1.1.jar:/usr/local/hadoop-1.2.1/libexec/../lib/giraph-nutch-1.1.0-SNAPSHOT-for-hadoop-0.20.203.0-jar-with-dependencies.jar:/usr/local/hadoop-1.2.1/libexec/../lib/hadoop-capacity-scheduler-1.2.1.jar:/usr/local/hadoop-1.2.1/libexec/../lib/hadoop-fairscheduler-1.2.1.jar:/usr/local/hadoop-1.2.1/libexec/../lib/hadoop-thriftfs-1.2.1.jar:/usr/local/hadoop-1.2.1/libexec/../lib/hsqldb-1.8.0.10.jar:/usr/local/hadoop-1.2.1/libexec/../lib/jackson-core-asl-1.8.8.jar:/usr/local/hadoop-1.2.1/libexec/../lib/jackson-mapper-asl-1.8.8.jar:/usr/local/hadoop-1.2.1/libexec/../lib/jasper-compiler-5.5.12.jar:/usr/local/hadoop-1.2.1/libexec/../lib/jasper-runtime-5.5.12.jar:/usr/local/hadoop-1.2.1/libexec/../lib/jdeb-0.8.jar:/usr/local/hadoop-1.2.1/libexec/../lib/jersey-core-1.8.jar:/usr/local/hadoop-1.2.1/libexec/../lib/jersey-json-1.8.jar:/usr/local/hadoop-1.2.1/libexec/../lib/jersey-server-1.8.jar:/usr/local/hadoop-1.2.1/libexec/../lib/jets3t-0.6.1.jar:/usr/local/hadoop-1.2.1/libexec/../lib/jetty-6.1.26.jar:/usr/local/hadoop-1.2.1/libexec/../lib/jetty-util-6.1.26.jar:/usr/local/hadoop-1.2.1/libexec/../lib/jsch-0.1.42.jar:/usr/local/hadoop-1.2.1/libexec/../lib/junit-4.5.jar:/usr/local/hadoop-1.2.1/libexec/../lib/kfs-0.2.2.jar:/usr/local/hadoop-1.2.1/libexec/../lib/log4j-1.2.15.jar:/usr/local/hadoop-1.2.1/libexec/../lib/mockito-all-1.8.5.jar:/usr/local/hadoop-1.2.1/libexec/../lib/oro-2.0.8.jar:/usr/local/hadoop-1.2.1/libexec/../lib/servlet-api-2.5-20081211.jar:/usr/local/hadoop-1.2.1/libexec/../lib/slf4j-api-1.4.3.jar:/usr/local/hadoop-1.2.1/libexec/../lib/slf4j-log4j12-1.4.3.jar:/usr/local/hadoop-1.2.1/libexec/../lib/xmlenc-0.52.jar:/usr/local/hadoop-1.2.1/libexec/../lib/jsp-2.1/jsp-2.1.jar:/usr/local/hadoop-1.2.1/libexec/../lib/jsp-2.1/jsp-api-2.1.jar
> 2013-11-08 19:00:07,449 INFO org.apache.zookeeper.ZooKeeper: Client environment:java.library.path=/usr/local/hadoop-1.2.1/libexec/../lib/native/Linux-amd64-64:/hadoop/mapred/taskTracker/root/jobcache/job_201311081157_0003/attempt_201311081157_0003_m_000009_0/work
> 2013-11-08 19:00:07,449 INFO org.apache.zookeeper.ZooKeeper: Client environment:java.io.tmpdir=/hadoop/mapred/taskTracker/root/jobcache/job_201311081157_0003/attempt_201311081157_0003_m_000009_0/work/tmp
> 2013-11-08 19:00:07,449 INFO org.apache.zookeeper.ZooKeeper: Client environment:java.compiler=<NA>
> 2013-11-08 19:00:07,449 INFO org.apache.zookeeper.ZooKeeper: Client environment:os.name=Linux
> 2013-11-08 19:00:07,449 INFO org.apache.zookeeper.ZooKeeper: Client environment:os.arch=amd64
> 2013-11-08 19:00:07,449 INFO org.apache.zookeeper.ZooKeeper: Client environment:os.version=2.6.32-358.18.1.el6.x86_64
> 2013-11-08 19:00:07,449 INFO org.apache.zookeeper.ZooKeeper: Client environment:user.name=root
> 2013-11-08 19:00:07,449 INFO org.apache.zookeeper.ZooKeeper: Client environment:user.home=/root
> 2013-11-08 19:00:07,450 INFO org.apache.zookeeper.ZooKeeper: Client environment:user.dir=/hadoop/mapred/taskTracker/root/jobcache/job_201311081157_0003/attempt_201311081157_0003_m_000009_0/work
> 2013-11-08 19:00:07,450 INFO org.apache.zookeeper.ZooKeeper: Initiating client connection,
connectString=host-10-6-149-98:2181 sessionTimeout=180000 watcher=hconnection
> 2013-11-08 19:00:07,465 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection
to server host-10-6-149-98/10.6.149.98:2181
> 2013-11-08 19:00:07,469 INFO org.apache.zookeeper.ClientCnxn: Socket connection established
to host-10-6-149-98/10.6.149.98:2181, initiating session
> 2013-11-08 19:00:07,475 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete
on server host-10-6-149-98/10.6.149.98:2181, sessionid = 0x142380291760046, negotiated timeout
= 40000
> 2013-11-08 19:00:07,663 INFO org.apache.hadoop.hbase.mapreduce.TableOutputFormat: Created
table instance for host
> 2013-11-08 19:00:07,674 INFO org.apache.hadoop.util.ProcessTree: setsid exited with exit
code 0
> 2013-11-08 19:00:07,683 INFO org.apache.hadoop.mapred.Task:  Using ResourceCalculatorPlugin
: org.apache.hadoop.util.LinuxResourceCalculatorPlugin@271b72b5
> 2013-11-08 19:00:07,762 INFO org.apache.hadoop.mapred.MapTask: Processing split: 'org.apache.giraph.bsp.BspInputSplit,
index=-1, num=-1
> 2013-11-08 19:00:07,806 INFO org.apache.giraph.graph.GraphTaskManager: setup: Log level
remains at info
> 2013-11-08 19:00:07,891 INFO org.apache.giraph.graph.GraphTaskManager: Distributed cache
is empty. Assuming fatjar.
> 2013-11-08 19:00:07,892 INFO org.apache.giraph.graph.GraphTaskManager: setup: classpath
@ /hadoop/mapred/taskTracker/root/jobcache/job_201311081157_0003/jars/job.jar for job org.apache.nutch.scoring.HostRankGiraphJob
> 2013-11-08 19:00:07,892 INFO org.apache.giraph.graph.GraphTaskManager: setup: Starting
up BspServiceWorker...
> 2013-11-08 19:00:07,906 INFO org.apache.giraph.bsp.BspService: BspService: Path to create
to halt is /_hadoopBsp/job_201311081157_0003/_haltComputation
> 2013-11-08 19:00:07,906 INFO org.apache.giraph.bsp.BspService: BspService: Connecting
to ZooKeeper with job job_201311081157_0003, 9 on host-10-6-149-98:2181
> 2013-11-08 19:00:07,907 INFO org.apache.zookeeper.ZooKeeper: Initiating client connection,
connectString=host-10-6-149-98:2181 sessionTimeout=60000 watcher=org.apache.giraph.worker.BspServiceWorker@7d766a1d
> 2013-11-08 19:00:07,918 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection
to server host-10-6-149-98/10.6.149.98:2181
> 2013-11-08 19:00:07,919 INFO org.apache.zookeeper.ClientCnxn: Socket connection established
to host-10-6-149-98/10.6.149.98:2181, initiating session
> 2013-11-08 19:00:07,922 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete
on server host-10-6-149-98/10.6.149.98:2181, sessionid = 0x142380291760055, negotiated timeout
= 40000
> 2013-11-08 19:00:07,922 INFO org.apache.giraph.bsp.BspService: process: Asynchronous
connection complete.
> 2013-11-08 19:00:08,004 INFO org.apache.giraph.comm.netty.NettyServer: NettyServer: Using
execution handler with 8 threads after requestFrameDecoder.
> 2013-11-08 19:00:08,042 INFO org.apache.giraph.comm.netty.NettyServer: start: Started
server communication server: host-10-6-149-104/10.6.149.104:30009 with up to 16 threads on
bind attempt 0 with sendBufferSize = 32768 receiveBufferSize = 524288 backlog = 15
> 2013-11-08 19:00:08,047 INFO org.apache.giraph.comm.netty.NettyClient: NettyClient: Using
execution handler with 8 threads after requestEncoder.
> 2013-11-08 19:00:08,058 INFO org.apache.giraph.graph.GraphTaskManager: setup: Registering
health of this worker...
> 2013-11-08 19:00:08,078 INFO org.apache.giraph.bsp.BspService: getJobState: Job state
already exists (/_hadoopBsp/job_201311081157_0003/_masterJobState)
> 2013-11-08 19:00:08,083 INFO org.apache.giraph.bsp.BspService: getApplicationAttempt:
Node /_hadoopBsp/job_201311081157_0003/_applicationAttemptsDir already exists!
> 2013-11-08 19:00:08,087 INFO org.apache.giraph.bsp.BspService: getApplicationAttempt:
Node /_hadoopBsp/job_201311081157_0003/_applicationAttemptsDir already exists!
> 2013-11-08 19:00:08,092 INFO org.apache.giraph.worker.BspServiceWorker: registerHealth:
Created my health node for attempt=0, superstep=-1 with /_hadoopBsp/job_201311081157_0003/_applicationAttemptsDir/0/_superstepDir/-1/_workerHealthyDir/host-10-6-149-104_9
and workerInfo= Worker(hostname=host-10-6-149-104, MRtaskID=9, port=30009)
> 2013-11-08 19:00:08,100 INFO org.apache.giraph.worker.BspServiceWorker: startSuperstep:
Master(hostname=host-10-6-149-103, MRtaskID=0, port=30000)
> 2013-11-08 19:00:08,100 INFO org.apache.giraph.worker.BspServiceWorker: startSuperstep:
Ready for computation on superstep -1 since worker selection and vertex range assignments
are done in /_hadoopBsp/job_201311081157_0003/_applicationAttemptsDir/0/_superstepDir/-1/_addressesAndPartitions
> 2013-11-08 19:00:08,101 INFO org.apache.giraph.comm.netty.NettyClient: Using Netty without
authentication.
> 2013-11-08 19:00:08,115 INFO org.apache.giraph.comm.netty.NettyClient: Using Netty without
authentication.
> 2013-11-08 19:00:08,120 INFO org.apache.giraph.comm.netty.NettyClient: Using Netty without
authentication.
> 2013-11-08 19:00:08,121 INFO org.apache.giraph.comm.netty.NettyClient: Using Netty without
authentication.
> 2013-11-08 19:00:08,121 INFO org.apache.giraph.comm.netty.NettyClient: Using Netty without
authentication.
> 2013-11-08 19:00:08,127 INFO org.apache.giraph.comm.netty.NettyClient: Using Netty without
authentication.
> 2013-11-08 19:00:08,129 INFO org.apache.giraph.comm.netty.NettyClient: Using Netty without
authentication.
> 2013-11-08 19:00:08,131 INFO org.apache.giraph.comm.netty.NettyClient: Using Netty without
authentication.
> 2013-11-08 19:00:08,131 INFO org.apache.giraph.comm.netty.NettyClient: Using Netty without
authentication.
> 2013-11-08 19:00:08,131 INFO org.apache.giraph.comm.netty.NettyClient: Using Netty without
authentication.
> 2013-11-08 19:00:08,132 INFO org.apache.giraph.comm.netty.NettyClient: Using Netty without
authentication.
> 2013-11-08 19:00:08,132 INFO org.apache.giraph.comm.netty.NettyClient: Using Netty without
authentication.
> 2013-11-08 19:00:08,132 INFO org.apache.giraph.comm.netty.NettyClient: Using Netty without
authentication.
> 2013-11-08 19:00:08,132 INFO org.apache.giraph.comm.netty.NettyClient: Using Netty without
authentication.
> 2013-11-08 19:00:08,139 INFO org.apache.giraph.comm.netty.NettyClient: connectAllAddresses:
Successfully added 14 connections, (14 total connected) 0 failed, 0 failures total.
> 2013-11-08 19:00:18,154 INFO org.apache.giraph.utils.TaskIdsPermitsBarrier: waitForRequiredPermits:
Waiting for 1 more tasks to send their aggregator data, task ids: [-1]
> 2013-11-08 19:00:38,155 INFO org.apache.giraph.utils.TaskIdsPermitsBarrier: waitForRequiredPermits:
Waiting for 1 more tasks to send their aggregator data, task ids: [-1]
> 2013-11-08 19:00:58,155 INFO org.apache.giraph.utils.TaskIdsPermitsBarrier: waitForRequiredPermits:
Waiting for 1 more tasks to send their aggregator data, task ids: [-1]
> 2013-11-08 19:01:08,156 INFO org.apache.giraph.utils.TaskIdsPermitsBarrier: waitForRequiredPermits:
Waiting for 1 more tasks to send their aggregator data, task ids: [-1]
> 2013-11-08 19:01:28,156 INFO org.apache.giraph.utils.TaskIdsPermitsBarrier: waitForRequiredPermits:
Waiting for 1 more tasks to send their aggregator data, task ids: [-1]
> 2013-11-08 19:01:38,157 INFO org.apache.giraph.utils.TaskIdsPermitsBarrier: waitForRequiredPermits:
Waiting for 1 more tasks to send their aggregator data, task ids: [-1]
> 2013-11-08 19:01:58,157 INFO org.apache.giraph.utils.TaskIdsPermitsBarrier: waitForRequiredPermits:
Waiting for 1 more tasks to send their aggregator data, task ids: [-1]
> 2013-11-08 19:02:08,158 INFO org.apache.giraph.utils.TaskIdsPermitsBarrier: waitForRequiredPermits:
Waiting for 1 more tasks to send their aggregator data, task ids: [-1]
> 2013-11-08 19:02:28,158 INFO org.apache.giraph.utils.TaskIdsPermitsBarrier: waitForRequiredPermits:
Waiting for 1 more tasks to send their aggregator data, task ids: [-1]
> 2013-11-08 19:02:32,555 WARN org.apache.giraph.bsp.BspService: process: Unknown and unprocessed
event (path=/_hadoopBsp/job_201311081157_0003/_applicationAttemptsDir/0/_superstepDir, type=NodeChildrenChanged,
state=SyncConnected)
> 2013-11-08 19:02:38,159 INFO org.apache.giraph.utils.TaskIdsPermitsBarrier: waitForRequiredPermits:
Waiting for 1 more tasks to send their aggregator data, task ids: [-1]
> 2013-11-08 19:02:58,159 INFO org.apache.giraph.utils.TaskIdsPermitsBarrier: waitForRequiredPermits:
Waiting for 1 more tasks to send their aggregator data, task ids: [-1]
> 2013-11-08 19:03:08,160 INFO org.apache.giraph.utils.TaskIdsPermitsBarrier: waitForRequiredPermits:
Waiting for 1 more tasks to send their aggregator data, task ids: [-1]
> 2013-11-08 19:03:28,160 INFO org.apache.giraph.utils.TaskIdsPermitsBarrier: waitForRequiredPermits:
Waiting for 1 more tasks to send their aggregator data, task ids: [-1]
> 2013-11-08 19:03:38,161 INFO org.apache.giraph.utils.TaskIdsPermitsBarrier: waitForRequiredPermits:
Waiting for 1 more tasks to send their aggregator data, task ids: [-1]
> 2013-11-08 19:03:58,161 INFO org.apache.giraph.utils.TaskIdsPermitsBarrier: waitForRequiredPermits:
Waiting for 1 more tasks to send their aggregator data, task ids: [-1]
> 2013-11-08 19:04:08,162 INFO org.apache.giraph.utils.TaskIdsPermitsBarrier: waitForRequiredPermits:
Waiting for 1 more tasks to send their aggregator data, task ids: [-1]
> 2013-11-08 19:04:28,162 INFO org.apache.giraph.utils.TaskIdsPermitsBarrier: waitForRequiredPermits:
Waiting for 1 more tasks to send their aggregator data, task ids: [-1]
> 2013-11-08 19:04:38,163 INFO org.apache.giraph.utils.TaskIdsPermitsBarrier: waitForRequiredPermits:
Waiting for 1 more tasks to send their aggregator data, task ids: [-1]
> 2013-11-08 19:04:58,163 INFO org.apache.giraph.utils.TaskIdsPermitsBarrier: waitForRequiredPermits:
Waiting for 1 more tasks to send their aggregator data, task ids: [-1]
> 2013-11-08 19:05:08,164 INFO org.apache.giraph.utils.TaskIdsPermitsBarrier: waitForRequiredPermits:
Waiting for 1 more tasks to send their aggregator data, task ids: [-1]
> 2013-11-08 19:05:28,164 INFO org.apache.giraph.utils.TaskIdsPermitsBarrier: waitForRequiredPermits:
Waiting for 1 more tasks to send their aggregator data, task ids: [-1]
> 2013-11-08 19:05:38,165 INFO org.apache.giraph.utils.TaskIdsPermitsBarrier: waitForRequiredPermits:
Waiting for 1 more tasks to send their aggregator data, task ids: [-1]
> 2013-11-08 19:05:58,165 INFO org.apache.giraph.utils.TaskIdsPermitsBarrier: waitForRequiredPermits:
Waiting for 1 more tasks to send their aggregator data, task ids: [-1]
> 2013-11-08 19:06:08,166 INFO org.apache.giraph.utils.TaskIdsPermitsBarrier: waitForRequiredPermits:
Waiting for 1 more tasks to send their aggregator data, task ids: [-1]
> 2013-11-08 19:06:28,166 INFO org.apache.giraph.utils.TaskIdsPermitsBarrier: waitForRequiredPermits:
Waiting for 1 more tasks to send their aggregator data, task ids: [-1]
> 2013-11-08 19:06:38,167 INFO org.apache.giraph.utils.TaskIdsPermitsBarrier: waitForRequiredPermits:
Waiting for 1 more tasks to send their aggregator data, task ids: [-1]
> 2013-11-08 19:06:58,167 INFO org.apache.giraph.utils.TaskIdsPermitsBarrier: waitForRequiredPermits:
Waiting for 1 more tasks to send their aggregator data, task ids: [-1]
> 2013-11-08 19:07:08,168 INFO org.apache.giraph.utils.TaskIdsPermitsBarrier: waitForRequiredPermits:
Waiting for 1 more tasks to send their aggregator data, task ids: [-1]
> 2013-11-08 19:07:26,593 INFO org.apache.giraph.worker.BspServiceWorker: processEvent:
Job state changed, checking to see if it needs to restart
> 2013-11-08 19:07:26,605 INFO org.apache.giraph.bsp.BspService: getJobState: Job state
already exists (/_hadoopBsp/job_201311081157_0003/_masterJobState)
> 2013-11-08 19:07:28,168 INFO org.apache.giraph.utils.TaskIdsPermitsBarrier: waitForRequiredPermits:
Waiting for 1 more tasks to send their aggregator data, task ids: [-1]
> 2013-11-08 19:07:38,169 INFO org.apache.giraph.utils.TaskIdsPermitsBarrier: waitForRequiredPermits:
Waiting for 1 more tasks to send their aggregator data, task ids: [-1]
> 2013-11-08 19:07:58,169 INFO org.apache.giraph.utils.TaskIdsPermitsBarrier: waitForRequiredPermits:
Waiting for 1 more tasks to send their aggregator data, task ids: [-1]
> 2013-11-08 19:08:08,170 INFO org.apache.giraph.utils.TaskIdsPermitsBarrier: waitForRequiredPermits:
Waiting for 1 more tasks to send their aggregator data, task ids: [-1]
> 2013-11-08 19:08:28,170 INFO org.apache.giraph.utils.TaskIdsPermitsBarrier: waitForRequiredPermits:
Waiting for 1 more tasks to send their aggregator data, task ids: [-1]


Mime
View raw message