Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id AE82C17989 for ; Fri, 7 Nov 2014 13:30:05 +0000 (UTC) Received: (qmail 43037 invoked by uid 500); 7 Nov 2014 13:30:04 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 42966 invoked by uid 500); 7 Nov 2014 13:30:03 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 42954 invoked by uid 99); 7 Nov 2014 13:30:03 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 07 Nov 2014 13:30:03 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of yuzhihong@gmail.com designates 209.85.192.173 as permitted sender) Received: from [209.85.192.173] (HELO mail-pd0-f173.google.com) (209.85.192.173) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 07 Nov 2014 13:29:37 +0000 Received: by mail-pd0-f173.google.com with SMTP id v10so3319234pde.32 for ; Fri, 07 Nov 2014 05:28:06 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=references:mime-version:in-reply-to:content-type :content-transfer-encoding:message-id:cc:from:subject:date:to; bh=fqTVieu0bm04tsV+8iFPrhyBWUxBZC1xWYqW4JjWr3Y=; b=zcVDjjIM0sBgePblUAwIxVLuY0CoWH0PQ3z+KQS1jvKxABlU9TTn59RLDnDk/x43CF jsB/Ub+ig8xsJ5Y0uM1YRilJXselNt1KDe/jOANYFxs9G/HHeycKPXZsbpLYW+Jjy7H1 8VgLkh1oBZrb5o3KC8B0g/0ohSeNVU5mnLWA4oRESm4QTSJu28RHAgDtuz7pzHHXJKMy zNPp7tUpE3IkCJ3bYbVWtz8QuF+IHOQTMJzzyJxOIddRqzVqCiFAjTjI0ooGUttn/rbd 3CKcktErKSuV5BHddy8cA4/9dMRWTryFH/8QD7gl8VSw/N1O7tiC3grv2w39r5dCQrjM BIbQ== X-Received: by 10.70.38.165 with SMTP id h5mr11764218pdk.121.1415366886060; Fri, 07 Nov 2014 05:28:06 -0800 (PST) Received: from [192.168.0.11] (c-24-130-236-83.hsd1.ca.comcast.net. [24.130.236.83]) by mx.google.com with ESMTPSA id hz4sm8823242pbc.22.2014.11.07.05.28.05 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Fri, 07 Nov 2014 05:28:05 -0800 (PST) References: <2014110712005908077159@sina.cn> Mime-Version: 1.0 (1.0) In-Reply-To: Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: quoted-printable Message-Id: <98A7BD8A-9217-45AC-9802-03C0C93580A0@gmail.com> Cc: user X-Mailer: iPhone Mail (10B146) From: Ted Yu Subject: Re: hbase cannot normally start regionserver in the environment of big data. Date: Fri, 7 Nov 2014 05:28:04 -0800 To: "user@hbase.apache.org" X-Virus-Checked: Checked by ClamAV on apache.org Please pastebin log from region server around the time it became dead.=20 What hbase / Hadoop version are you using ? Anything interesting in master log ? Thanks On Nov 7, 2014, at 4:57 AM, Jean-Marc Spaggiari wr= ote: > Hi, >=20 > Have you checked that your Hadoop is running fine? Have you checked that > network between your servers is fine to? >=20 > JM >=20 > 2014-11-07 5:22 GMT-05:00 hankedang@sina.cn : >=20 >> I've deploied a "2+4" cluster which has been normally running for a >> long time. >> The cluster has got more than 40T data.When I initiatively shut the hbase= >> service >> and try to restart it,the regionserver will be dead. >>=20 >> The log of regionserver shows that all the regions are opened. But in >> the logs of the datanode can see WARN and ERROR logs. >> Bellow is the log for details: >>=20 >> 2014-11-07 14:47:21,584 INFO >> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: / >> 10.230.63.12:50010, dest: /10.230.63.9:39405, bytes: 4696, op: HDFS_READ,= >> cliID: DFSClient_hb_rs_salve1,60020,1415342303886_- >> 2037622978_29, offset: 31996928, srvID: >> bb0032a3-1170-4a34-b85b-e2cfa0d56cb2, blockid: BP-1731746090-10.230.63.3-= >> 1406195669990:blk_1078709392_4968828, duration: 7978822 >> 2014-11-07 14:47:21,596 INFO >> org.apache.hadoop.hdfs.server.datanode.DataNode: exception: >> java.net.SocketTimeoutException: 480000 millis timeout while waiting >> for channel to be ready for write. ch : >> java.nio.channels.SocketChannel[connected local=3D/10.230.63.12:50010 >> remote=3D/10.230.63.11:41511] >> at >> org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.j= ava:246) >> at >> org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStre= am.java:172) >> at >> org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStre= am.java:220) >> at >> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender= .java:547) >> at >> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.= java:712) >> at >> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.= java:479) >> at >> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receive= r.java:110) >> at >> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.= java:68) >> at >> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:2= 29) >> at java.lang.Thread.run(Thread.java:744) >> 2014-11-07 14:47:21,599 INFO >> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: / >> 10.230.63.12:50010, dest: /10.230.63.11:41511, bytes: 726528, op: >> HDFS_READ, cliID: DFSClient_hb_rs_salve3,60020,1415342303807_1094119849_2= 9, >> offset: 0, srvID: bb0032a3-1170-4a34-b85b-e2cfa0d56cb2, blockid: >> BP-1731746090-10.230.63.3-1406195669990:blk_1078034913_4294168, duration:= >> 480190668115 >> 2014-11-07 14:47:21,599 WARN >> org.apache.hadoop.hdfs.server.datanode.DataNode: >> DatanodeRegistration(10.230.63.12, >> datanodeUuid=3Dbb0032a3-1170-4a34-b85b-e2cfa0d56cb2, infoPort=3D50075, >> ipcPort=3D50020, storageInfo=3Dlv=3D-55;cid=3Dcluster12;nsid=3D395652542;= c=3D0):Got >> exception while serving >> BP-1731746090-10.230.63.3-1406195669990:blk_1078034913_4294168 to / >> 10.230.63.11:41511 >> java.net.SocketTimeoutException: 480000 millis timeout while waiting for >> channel to be ready for write. ch : >> java.nio.channels.SocketChannel[connected local=3D/10.230.63.12:50010 >> remote=3D/10.230.63.11:41511] >> at >> org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.j= ava:246) >> at >> org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStre= am.java:172) >> at >> org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStre= am.java:220) >> at >> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender= .java:547) >> at >> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.= java:712) >> at >> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.= java:479) >> at >> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receive= r.java:110) >> at >> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.= java:68) >> at >> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:2= 29) >> at java.lang.Thread.run(Thread.java:744) >> 2014-11-07 14:47:21,600 ERROR >> org.apache.hadoop.hdfs.server.datanode.DataNode: salve4:50010:DataXceiver= >> error processing READ_BLOCK operation src: /10.230.63.11:41511 dest: / >> 10.230.63.12:50010 >>=20 >>=20 >> I personally think it was caused on the load on open stage,where the >> disk IO of the cluster can >> be very high and the pressure can be huge. >>=20 >> I wonder what results in reading error while reading hfile,and what >> leads to timeout. >> Are there any solutions that can control the speed of loading on open and= >> reduce >> pressure of the cluster? >>=20 >> I need help ! >>=20 >> Thanks! >>=20 >>=20 >>=20 >>=20 >> hankedang@sina.cn >>=20