Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 3375E111FF for ; Tue, 22 Apr 2014 07:03:14 +0000 (UTC) Received: (qmail 59008 invoked by uid 500); 22 Apr 2014 07:03:10 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 58922 invoked by uid 500); 22 Apr 2014 07:03:09 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 58913 invoked by uid 99); 22 Apr 2014 07:03:09 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 22 Apr 2014 07:03:09 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,NORMAL_HTTP_TO_IP,RCVD_IN_DNSWL_LOW,SPF_PASS,WEIRD_PORT X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of azuryyyu@gmail.com designates 209.85.216.180 as permitted sender) Received: from [209.85.216.180] (HELO mail-qc0-f180.google.com) (209.85.216.180) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 22 Apr 2014 07:03:04 +0000 Received: by mail-qc0-f180.google.com with SMTP id w7so4923667qcr.25 for ; Tue, 22 Apr 2014 00:02:44 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=WClG5I26JZGX0qsyzVicj3sL/vVgz142CfP3oCKBVp8=; b=hhnSQhnszEkcocd5ukNfH9uHo8XKvDXaxTKA3zxZWO5z4EPxtl2jWiiPTaMiVOkDK9 Lecue9uVLBeY3q1KbwpL0U/3id5maImj7lm9FR9Y3XQj4N8TLHFRuNB0JB89yMLMezyY oHgKl79Ylo6fKO59kDYH/trePwCAF2b2kBzdR19irn07NHXHcDG2yXfCVq/rnTEe+U0Z XH8AK0+avelm65LId5gA3SFy1r7Uyk3HT4qTB8175NMIA6um8VECzePU12JsWNMbXWwC B5c2Eqb2E5gERmDI0qq7Y+pzMg6eFusqeKFK4I8wlbYcqrTDVG4Iual4zbeh13Klol2g Jqhg== MIME-Version: 1.0 X-Received: by 10.140.81.197 with SMTP id f63mr352377qgd.114.1398150163971; Tue, 22 Apr 2014 00:02:43 -0700 (PDT) Received: by 10.140.21.176 with HTTP; Tue, 22 Apr 2014 00:02:43 -0700 (PDT) In-Reply-To: References: Date: Tue, 22 Apr 2014 15:02:43 +0800 Message-ID: Subject: Re: is my hbase cluster overloaded? From: Azuryy Yu To: user@hbase.apache.org Content-Type: multipart/alternative; boundary=001a11c129bede267304f79c33bb X-Virus-Checked: Checked by ClamAV on apache.org --001a11c129bede267304f79c33bb Content-Type: text/plain; charset=UTF-8 Do you still have the same issue? and: -Xmx8000m -server -XX:NewSize=512m -XX:MaxNewSize=512m the Eden size is too small. On Tue, Apr 22, 2014 at 2:55 PM, Li Li wrote: > > dfs.datanode.handler.count > 100 > The number of server threads for the datanode. > > > > 1. namenode/master 192.168.10.48 > http://pastebin.com/7M0zzAAc > > $free -m (this is value when I restart the hadoop and hbase now, not > the value when it crashed) > total used free shared buffers cached > Mem: 15951 3819 12131 0 509 1990 > -/+ buffers/cache: 1319 14631 > Swap: 8191 0 8191 > > 2. datanode/region 192.168.10.45 > http://pastebin.com/FiAw1yju > > $free -m > total used free shared buffers cached > Mem: 15951 3627 12324 0 1516 641 > -/+ buffers/cache: 1469 14482 > Swap: 8191 8 8183 > > On Tue, Apr 22, 2014 at 2:29 PM, Azuryy Yu wrote: > > one big possible issue is that you have a high concurrent request on HDFS > > or HBASE, then all Data nodes handlers are all busy, then more requests > are > > pending, then timeout, so you can try to increase > > dfs.datanode.handler.count and dfs.namenode.handler.count in the > > hdfs-site.xml, then restart the HDFS. > > > > another, do you have datanode, namenode, region servers JVM options? if > > they are all by default, then there is also have this issue. > > > > > > > > > > On Tue, Apr 22, 2014 at 2:20 PM, Li Li wrote: > > > >> my cluster setup: both 6 machines are virtual machine. each machine: > >> 4CPU Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz 16GB memory > >> 192.168.10.48 namenode/jobtracker > >> 192.168.10.47 secondary namenode > >> 192.168.10.45 datanode/tasktracker > >> 192.168.10.46 datanode/tasktracker > >> 192.168.10.49 datanode/tasktracker > >> 192.168.10.50 datanode/tasktracker > >> > >> hdfs logs around 20:33 > >> 192.168.10.48 namenode log http://pastebin.com/rwgmPEXR > >> 192.168.10.45 datanode log http://pastebin.com/HBgZ8rtV (I found this > >> datanode crash first) > >> 192.168.10.46 datanode log http://pastebin.com/aQ2emnUi > >> 192.168.10.49 datanode log http://pastebin.com/aqsWrrL1 > >> 192.168.10.50 datanode log http://pastebin.com/V7C6tjpB > >> > >> hbase logs around 20:33 > >> 192.168.10.48 master log http://pastebin.com/2ZfeYA1p > >> 192.168.10.45 region log http://pastebin.com/idCF2a7Y > >> 192.168.10.46 region log http://pastebin.com/WEh4dA0f > >> 192.168.10.49 region log http://pastebin.com/cGtpbTLz > >> 192.168.10.50 region log http://pastebin.com/bD6h5T6p(very strange, > >> not log at 20:33, but have log at 20:32 and 20:34) > >> > >> On Tue, Apr 22, 2014 at 12:25 PM, Ted Yu wrote: > >> > Can you post more of the data node log, around 20:33 ? > >> > > >> > Cheers > >> > > >> > > >> > On Mon, Apr 21, 2014 at 8:57 PM, Li Li wrote: > >> > > >> >> hadoop 1.0 > >> >> hbase 0.94.11 > >> >> > >> >> datanode log from 192.168.10.45. why it shut down itself? > >> >> > >> >> 2014-04-21 20:33:59,309 INFO > >> >> org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock > >> >> blk_-7969006819959471805_202154 received exception > >> >> java.io.InterruptedIOException: Interruped while waiting for IO on > >> >> channel java.nio.channels.SocketChannel[closed]. 0 millis timeout > >> >> left. > >> >> 2014-04-21 20:33:59,310 ERROR > >> >> org.apache.hadoop.hdfs.server.datanode.DataNode: > >> >> DatanodeRegistration(192.168.10.45:50010, > >> >> storageID=DS-1676697306-192.168.10.45-50010-1392029190949, > >> >> infoPort=50075, ipcPort=50020):DataXceiver > >> >> java.io.InterruptedIOException: Interruped while waiting for IO on > >> >> channel java.nio.channels.SocketChannel[closed]. 0 millis timeout > >> >> left. > >> >> at > >> >> > >> > org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:349) > >> >> at > >> >> > >> > org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157) > >> >> at > >> >> > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155) > >> >> at > >> >> > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128) > >> >> at > >> java.io.BufferedInputStream.read1(BufferedInputStream.java:273) > >> >> at > >> java.io.BufferedInputStream.read(BufferedInputStream.java:334) > >> >> at java.io.DataInputStream.read(DataInputStream.java:149) > >> >> at > >> >> > >> > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:265) > >> >> at > >> >> > >> > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:312) > >> >> at > >> >> > >> > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:376) > >> >> at > >> >> > >> > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:532) > >> >> at > >> >> > >> > org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:398) > >> >> at > >> >> > >> > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:107) > >> >> at java.lang.Thread.run(Thread.java:722) > >> >> 2014-04-21 20:33:59,310 ERROR > >> >> org.apache.hadoop.hdfs.server.datanode.DataNode: > >> >> DatanodeRegistration(192.168.10.45:50010, > >> >> storageID=DS-1676697306-192.168.10.45-50010-1392029190949, > >> >> infoPort=50075, ipcPort=50020):DataXceiver > >> >> java.io.InterruptedIOException: Interruped while waiting for IO on > >> >> channel java.nio.channels.SocketChannel[closed]. 466924 millis > timeout > >> >> left. > >> >> at > >> >> > >> > org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:349) > >> >> at > >> >> > >> > org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:245) > >> >> at > >> >> > >> > org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159) > >> >> at > >> >> > >> > org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198) > >> >> at > >> >> > >> > org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:350) > >> >> at > >> >> > >> > org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:436) > >> >> at > >> >> > >> > org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:197) > >> >> at > >> >> > >> > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:99) > >> >> at java.lang.Thread.run(Thread.java:722) > >> >> 2014-04-21 20:34:00,291 INFO > >> >> org.apache.hadoop.hdfs.server.datanode.DataNode: Waiting for > >> >> threadgroup to exit, active threads is 0 > >> >> 2014-04-21 20:34:00,404 INFO > >> >> org.apache.hadoop.hdfs.server.datanode.FSDatasetAsyncDiskService: > >> >> Shutting down all async disk service threads... > >> >> 2014-04-21 20:34:00,405 INFO > >> >> org.apache.hadoop.hdfs.server.datanode.FSDatasetAsyncDiskService: All > >> >> async disk service threads have been shut down. > >> >> 2014-04-21 20:34:00,413 INFO > >> >> org.apache.hadoop.hdfs.server.datanode.DataNode: Exiting Datanode > >> >> 2014-04-21 20:34:00,424 INFO > >> >> org.apache.hadoop.hdfs.server.datanode.DataNode: SHUTDOWN_MSG: > >> >> /************************************************************ > >> >> SHUTDOWN_MSG: Shutting down DataNode at app-hbase-1/192.168.10.45 > >> >> ************************************************************/ > >> >> > >> >> On Tue, Apr 22, 2014 at 11:25 AM, Ted Yu > wrote: > >> >> > bq. one datanode failed > >> >> > > >> >> > Was the crash due to out of memory error ? > >> >> > Can you post the tail of data node log on pastebin ? > >> >> > > >> >> > Giving us versions of hadoop and hbase would be helpful. > >> >> > > >> >> > > >> >> > On Mon, Apr 21, 2014 at 7:39 PM, Li Li > wrote: > >> >> > > >> >> >> I have a small hbase cluster with 1 namenode, 1 secondary > namenode, 4 > >> >> >> datanode. > >> >> >> and the hbase master is on the same machine with namenode, 4 hbase > >> >> >> slave on datanode machine. > >> >> >> I found average requests per seconds is about 10,000. and the > >> clusters > >> >> >> crashed. and I found the reason is one datanode failed. > >> >> >> > >> >> >> the datanode configuration is about 4 cpu core and 10GB memory > >> >> >> is my cluster overloaded? > >> >> >> > >> >> > >> > --001a11c129bede267304f79c33bb--