Return-Path: Delivered-To: apmail-hadoop-hbase-dev-archive@locus.apache.org Received: (qmail 22212 invoked from network); 22 Dec 2008 11:49:09 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 22 Dec 2008 11:49:09 -0000 Received: (qmail 89471 invoked by uid 500); 22 Dec 2008 11:49:09 -0000 Delivered-To: apmail-hadoop-hbase-dev-archive@hadoop.apache.org Received: (qmail 89330 invoked by uid 500); 22 Dec 2008 11:49:08 -0000 Mailing-List: contact hbase-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hbase-dev@hadoop.apache.org Delivered-To: mailing list hbase-dev@hadoop.apache.org Received: (qmail 89319 invoked by uid 99); 22 Dec 2008 11:49:08 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 22 Dec 2008 03:49:08 -0800 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 22 Dec 2008 11:49:05 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 57A4A234C40B for ; Mon, 22 Dec 2008 03:48:44 -0800 (PST) Message-ID: <1527946984.1229946524357.JavaMail.jira@brutus> Date: Mon, 22 Dec 2008 03:48:44 -0800 (PST) From: "Thibaut (JIRA)" To: hbase-dev@hadoop.apache.org Subject: [jira] Commented: (HBASE-1078) "java.io.IOException: Could not obtain block": allthough file is there and accessible through the dfs client In-Reply-To: <1238696057.1229944844281.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/HBASE-1078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12658516#action_12658516 ] Thibaut commented on HBASE-1078: -------------------------------- At least one DFS client node has lots of DataXceiver exceptions. (I had removed the hadoop config entry for this and I will set dfs.datanode.max.xcievers to 2048 and retry). 08/12/22 12:33:07 INFO datanode.DataNode: writeBlock blk_2596678729303248869_20816 received exception java.io.EOFException: while trying to read 65557 bytes 08/12/22 12:33:07 ERROR datanode.DataNode: DatanodeRegistration(192.168.0.7:50010, storageID=DS-1147686120-78.46.82.211-50010-1229624756268, infoPort=50075, ipcPort=50020):DataXceiver java.io.EOFException: while trying to read 65557 bytes at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:254) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:298) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:362) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:514) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:356) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:102) at java.lang.Thread.run(Thread.java:619) 08/12/22 12:33:07 INFO datanode.DataNode: PacketResponder blk_-3816193612223297244_20739 1 Exception java.net.SocketException: Broken pipe at java.net.SocketOutputStream.socketWrite0(Native Method) at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:92) at java.net.SocketOutputStream.write(SocketOutputStream.java:136) at java.io.DataOutputStream.writeLong(DataOutputStream.java:207) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:836) at java.lang.Thread.run(Thread.java:619) 08/12/22 12:33:07 INFO datanode.DataNode: PacketResponder blk_-3816193612223297244_20739 1 : Thread is interrupted. 08/12/22 12:33:07 INFO datanode.DataNode: PacketResponder 1 for block blk_-3816193612223297244_20739 terminating 08/12/22 12:33:07 INFO datanode.DataNode: writeBlock blk_-3816193612223297244_20739 received exception java.io.EOFException: while trying to read 65557 bytes 08/12/22 12:33:07 ERROR datanode.DataNode: DatanodeRegistration(192.168.0.7:50010, storageID=DS-1147686120-78.46.82.211-50010-1229624756268, infoPort=50075, ipcPort=50020):DataXceiver java.io.EOFException: while trying to read 65557 bytes at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:254) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:298) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:362) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:514) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:356) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:102) at java.lang.Thread.run(Thread.java:619) 08/12/22 12:33:10 INFO datanode.DataNode: PacketResponder blk_5759147435898729286_20763 1 Exception java.net.SocketException: Broken pipe at java.net.SocketOutputStream.socketWrite0(Native Method) at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:92) at java.net.SocketOutputStream.write(SocketOutputStream.java:136) at java.io.DataOutputStream.writeLong(DataOutputStream.java:207) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:836) at java.lang.Thread.run(Thread.java:619) 08/12/22 12:33:10 INFO datanode.DataNode: PacketResponder blk_5759147435898729286_20763 1 : Thread is interrupted. 08/12/22 12:33:10 INFO datanode.DataNode: PacketResponder 1 for block blk_5759147435898729286_20763 terminating 08/12/22 12:33:10 INFO datanode.DataNode: writeBlock blk_5759147435898729286_20763 received exception java.io.EOFException: while trying to read 65557 bytes 08/12/22 12:33:10 ERROR datanode.DataNode: DatanodeRegistration(192.168.0.7:50010, storageID=DS-1147686120-78.46.82.211-50010-1229624756268, infoPort=50075, ipcPort=50020):DataXceiver java.io.EOFException: while trying to read 65557 bytes at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:254) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:298) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:362) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:514) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:356) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:102) at java.lang.Thread.run(Thread.java:619) This might be causing the hbase failure then. When I try to shut down the cluster, two region servers hang. Here is the stack trace of one of them, (the other one is identical) 2008-12-22 12:38:16 Full thread dump Java HotSpot(TM) 64-Bit Server VM (10.0-b23 mixed mode): "Attach Listener" daemon prio=10 tid=0x00002aab354cac00 nid=0x7af7 waiting on condition [0x0000000000000000..0x0000000000000000] java.lang.Thread.State: RUNNABLE "pool-1-thread-1" prio=10 tid=0x00002aab35a2b400 nid=0x22b0 waiting on condition [0x0000000042a52000..0x0000000042a52aa0] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x00002aaab554efc0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925) at java.util.concurrent.DelayQueue.take(DelayQueue.java:160) at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:582) at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:575) at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:946) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:906) at java.lang.Thread.run(Thread.java:619) "DestroyJavaVM" prio=10 tid=0x00002aab381c8000 nid=0x2275 waiting on condition [0x0000000000000000..0x000000004022ad10] java.lang.Thread.State: RUNNABLE "Low Memory Detector" daemon prio=10 tid=0x00002aab3483c000 nid=0x227e runnable [0x0000000000000000..0x0000000000000000] java.lang.Thread.State: RUNNABLE "CompilerThread1" daemon prio=10 tid=0x00002aab34839800 nid=0x227d waiting on condition [0x0000000000000000..0x0000000040a31440] java.lang.Thread.State: RUNNABLE "CompilerThread0" daemon prio=10 tid=0x00002aab34836000 nid=0x227c waiting on condition [0x0000000000000000..0x00000000409303c0] java.lang.Thread.State: RUNNABLE "Signal Dispatcher" daemon prio=10 tid=0x00002aab34834c00 nid=0x227b runnable [0x0000000000000000..0x0000000040830730] java.lang.Thread.State: RUNNABLE "Finalizer" daemon prio=10 tid=0x00002aab34628400 nid=0x227a in Object.wait() [0x000000004072f000..0x000000004072fa20] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:116) - locked <0x00002aaab535e3f8> (a java.lang.ref.ReferenceQueue$Lock) at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:132) at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159) "Reference Handler" daemon prio=10 tid=0x00002aab34627000 nid=0x2279 in Object.wait() [0x000000004062e000..0x000000004062eda0] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:485) at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116) - locked <0x00002aaab5290178> (a java.lang.ref.Reference$Lock) "VM Thread" prio=10 tid=0x00002aab34621c00 nid=0x2278 runnable "GC task thread#0 (ParallelGC)" prio=10 tid=0x000000004011e000 nid=0x2276 runnable "GC task thread#1 (ParallelGC)" prio=10 tid=0x000000004011f400 nid=0x2277 runnable "VM Periodic Task Thread" prio=10 tid=0x00002aab3483dc00 nid=0x227f waiting on condition JNI global references: 1001 There is nothing suspicious in the log file, it just ends with: 2008-12-22 12:26:16,117 INFO [regionserver/0:0:0:0:0:0:0:0:60020] regionserver.HRegionServer(508): telling master that region server is shutting down at: 192.168.0.7:60020 2008-12-22 12:26:16,122 INFO [regionserver/0:0:0:0:0:0:0:0:60020] regionserver.HRegionServer(515): stopping server at: 192.168.0.7:60020 2008-12-22 12:26:16,545 INFO [regionserver/0:0:0:0:0:0:0:0:60020] regionserver.HRegionServer(525): regionserver/0:0:0:0:0:0:0:0:60020 exiting > "java.io.IOException: Could not obtain block": allthough file is there and accessible through the dfs client > ------------------------------------------------------------------------------------------------------------ > > Key: HBASE-1078 > URL: https://issues.apache.org/jira/browse/HBASE-1078 > Project: Hadoop HBase > Issue Type: Bug > Environment: hadoop 0.19.0 > hbase 0.19.0-dev, r728134 > Reporter: Thibaut > Attachments: errorlogs.zip > > > Hi, > after doing some more stress testing, my cluster did just stopped working. The regionserver reponsible for the ROOT region can't read a block related to the root region, but it is definitely there as I can read the file through the dfs client. > All new clients fail to start: > java.io.IOException: java.io.IOException: Could not obtain block: blk_-3504243288385983835_18732 file=/hbase/-ROOT-/70236052/info/mapfiles/780254459775584115/data > at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1708) > at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1536) > at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1663) > at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1593) > at java.io.DataInputStream.readInt(DataInputStream.java:370) > at org.apache.hadoop.io.SequenceFile$Reader.readRecordLength(SequenceFile.java:1895) > at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1925) > at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1830) > at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1876) > at org.apache.hadoop.io.MapFile$Reader.next(MapFile.java:517) > at org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromMapFile(HStore.java:1709) > at org.apache.hadoop.hbase.regionserver.HStore.getRowKeyAtOrBefore(HStore.java:1681) > at org.apache.hadoop.hbase.regionserver.HRegion.getClosestRowBefore(HRegion.java:1072) > at org.apache.hadoop.hbase.regionserver.HRegionServer.getClosestRowBefore(HRegionServer.java:1466) > at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source) > at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:632) > at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:894) > at sun.reflect.GeneratedConstructorAccessor13.newInstance(Unknown Source) > at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) > at java.lang.reflect.Constructor.newInstance(Constructor.java:513) > at org.apache.hadoop.hbase.RemoteExceptionHandler.decodeRemoteException(RemoteExceptionHandler.java:95) > at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegionInMeta(HConnectionManager.java:550) > at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:450) > at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.relocateRegion(HConnectionManager.java:422) > at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegionInMeta(HConnectionManager.java:559) > at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:454) > at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:415) > at org.apache.hadoop.hbase.client.HTable.(HTable.java:113) > at org.apache.hadoop.hbase.client.HTable.(HTable.java:96) > Clients that are still connected do still work. (As they have probably cached the ROOT region?) > This seemed to have happend after one of the region servers (number 3) shut itselfdown due to exceptions (EOFileException, Unable to create new block, etc... see logfile) The ROOT region then probably got moved to region server 2. > I attached the logs (DEBUG enabled) of the hdfs namenode, the hbase master node and the two log files of regionserver 2 and 3. > The filesystem is in healthy state. I can also download the file through the hadoop fs command without any problem and without getting an error message about missing blocks. > Status: HEALTHY > Total size: 142881532319 B (Total open files size: 12415139840 B) > Total dirs: 4153 > Total files: 3541 (Files currently being written: 106) > Total blocks (validated): 5208 (avg. block size 27435010 B) (Total open file blocks (not validated): 205) > Minimally replicated blocks: 5208 (100.0 %) > Over-replicated blocks: 0 (0.0 %) > Under-replicated blocks: 0 (0.0 %) > Mis-replicated blocks: 0 (0.0 %) > Default replication factor: 4 > Average block replication: 4.0 > Corrupt blocks: 0 > Missing replicas: 0 (0.0 %) > Number of data-nodes: 7 > Number of racks: 1 > The filesystem under path '/' is HEALTHY -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.