Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id DBF1C9187 for ; Mon, 25 Jun 2012 09:01:50 +0000 (UTC) Received: (qmail 6969 invoked by uid 500); 25 Jun 2012 09:01:49 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 6704 invoked by uid 500); 25 Jun 2012 09:01:48 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 6654 invoked by uid 99); 25 Jun 2012 09:01:47 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 25 Jun 2012 09:01:47 +0000 X-ASF-Spam-Status: No, hits=0.0 required=5.0 tests=FSL_RCVD_USER,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of frederic.fondement@uha.fr designates 192.93.19.112 as permitted sender) Received: from [192.93.19.112] (HELO serv-pmx.univ-mulhouse.fr) (192.93.19.112) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 25 Jun 2012 09:01:37 +0000 Received: from serv-pmx.univ-mulhouse.fr (localhost [127.0.0.1]) by localhost (Postfix) with SMTP id E4D4B10355C for ; Mon, 25 Jun 2012 11:11:21 +0200 (CEST) Received: from smtpmul2.univ-mulhouse.fr (smtpmul2.univ-mulhouse.fr [10.9.0.22]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by serv-pmx.univ-mulhouse.fr (Postfix) with ESMTPS id D029810355A for ; Mon, 25 Jun 2012 11:11:21 +0200 (CEST) Received: from rainette.univ-mulhouse.fr (rainette.univ-mulhouse.fr [10.59.10.1]) by smtpmul2.univ-mulhouse.fr (8.13.8/8.13.8) with ESMTP id q5P91GDx006243 for ; Mon, 25 Jun 2012 11:01:17 +0200 Message-ID: <4FE828BB.5030701@uha.fr> Date: Mon, 25 Jun 2012 11:00:43 +0200 From: =?ISO-8859-1?Q?Fr=E9d=E9ric_Fondement?= Organization: =?ISO-8859-1?Q?Universit=E9_de_Haute_Alsace_-_ENS?= =?ISO-8859-1?Q?ISA?= User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:10.0.4) Gecko/20120510 Icedove/10.0.4 MIME-Version: 1.0 To: "user@hbase.apache.org" Subject: datanode timeout Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 8bit Hi all ! I'm getting trouble with my HBase as the following error appears more and more often (each 2 to 15 mins on each node): 2012-06-25 10:25:30,646 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(10.120.0.5:50010, storageID=DS-1339564791-127.0.0.1-50010-1296151113818, infoPort=50075, ipcPort=50020):Got exception while serving blk_4839251368515801234_555101 to /10.120.0.5: java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/10.120.0.5:50010 remote=/10.120.0.5:42564] at org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:246) at org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159) at org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198) at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:397) at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:493) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:267) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:163) 2012-06-25 10:25:30,646 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(10.120.0.5:50010, storageID=DS-1339564791-127.0.0.1-50010-1296151113818, infoPort=50075, ipcPort=50020):DataXceiver java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/10.120.0.5:50010 remote=/10.120.0.5:42564] at org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:246) at org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159) at org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198) at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:397) at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:493) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:267) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:163) You might have guessed that local machine is 10.120.0.5. Unsuprisingly, process on port 50010 is the datanode. Port 42564 is changing depending on the error instance, and seems to correspond to the regionserver process. If I ask for processes connected to port 50010 using an 'lsof -i :50010', I have an impressive number of sockets (#400). Is it normal ? I need to add that current load (requests, IOs, CPU, ...) is rather slow. I can't find any other error in namenode or regionserver logs. All the best, Fr�d�ric.