Return-Path: Delivered-To: apmail-hbase-user-archive@www.apache.org Received: (qmail 71511 invoked from network); 8 Mar 2011 18:11:56 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 8 Mar 2011 18:11:56 -0000 Received: (qmail 91529 invoked by uid 500); 8 Mar 2011 16:22:35 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 91499 invoked by uid 500); 8 Mar 2011 16:22:35 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 91491 invoked by uid 99); 8 Mar 2011 16:22:35 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 08 Mar 2011 16:22:35 +0000 X-ASF-Spam-Status: No, hits=1.8 required=5.0 tests=FREEMAIL_FROM,FREEMAIL_REPLY,RCVD_IN_DNSWL_LOW,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of cjjvictory@gmail.com designates 209.85.161.41 as permitted sender) Received: from [209.85.161.41] (HELO mail-fx0-f41.google.com) (209.85.161.41) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 08 Mar 2011 16:22:30 +0000 Received: by fxm5 with SMTP id 5so7091706fxm.14 for ; Tue, 08 Mar 2011 08:22:09 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type:content-transfer-encoding; bh=4S/yOD3eZjhR79QD2Y61FZ25PiAa/tQk/79FKCcygVo=; b=yHlYeOrZd/9+fUeJ/E9jHWVSTW5WzAC3YavvwgH9o8yb67gsyeePB6Ik5MnpCF7t3J igAhqNJ54OwAiUUIsxkbgzG5Ae2SeTH5/WCAvnZEcV2F9NC2EuHGQHaj6ay13oHL1JX0 8P59svT0FkWFol2S9xgFuOiryF5yuKmihstqA= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=ipCKh6H6OCOGx8mVobJUPlaxH30RugApGUweLi4JhIvcDwQi2rzAdyh1lYqxFIY88G zcV+amL09Lxi8rPuAMGogMnK2werjQxDE08uO53liqKxlZMRmaBNGxHhenqlcXeAwLUX IhNrIs5l6DVmBqi+n6chVVke2fXQnHNwUJpUQ= MIME-Version: 1.0 Received: by 10.223.91.79 with SMTP id l15mr4177027fam.53.1299601301236; Tue, 08 Mar 2011 08:21:41 -0800 (PST) Received: by 10.223.25.137 with HTTP; Tue, 8 Mar 2011 08:21:41 -0800 (PST) In-Reply-To: References: Date: Wed, 9 Mar 2011 00:21:41 +0800 Message-ID: Subject: Re: region servers dying - flush request - YCSB From: =?UTF-8?B?6ZmI5Yqg5L+K?= To: user@hbase.apache.org Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Htable had disabled when ctr+c ? 2011/3/8, M.Deniz OKTAR : > Something new came up! > > I tried to truncate the 'usertable' which had ~12M entries. > > Shell stayed at "disabling table" for a long time. The processes was ther= e > but there were no requests. So I quit the state by ctrl-c. > > Then tried count 'usertable' to see if data remains, shell gave an error = and > one of the regionservers had a log such as below, > > The master logs were also similar (tried to disable again, and the master > log is from that trial) > > > Regionserver 2: > > 2011-03-08 16:47:24,852 DEBUG > org.apache.hadoop.hbase.regionserver.HRegionServer: > NotServingRegionException; Region is not online: > usertable,,1299593459085.d37bb124feaf8f5d08e51064a36596f8. > 2011-03-08 16:47:27,765 DEBUG > org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats: total=3D39.63 = MB, > free=3D4.65 GB, max=3D4.68 GB, blocks=3D35, accesses=3D376070, hits=3D120= 35, > hitRatio=3D3.20%%, cachingAccesses=3D12070, cachingHits=3D12035, > cachingHitsRatio=3D99.71%%, evictions=3D0, evicted=3D0, evictedPerRun=3DN= aN > 2011-03-08 16:47:28,863 DEBUG > org.apache.hadoop.hbase.regionserver.HRegionServer: > NotServingRegionException; Region is not online: > usertable,,1299593459085.d37bb124feaf8f5d08e51064a36596f8. > 2011-03-08 16:47:28,865 ERROR > org.apache.hadoop.hbase.regionserver.HRegionServer: > org.apache.hadoop.hbase.UnknownScannerException: Name: -1 > at > org.apache.hadoop.hbase.regionserver.HRegionServer.next(HRegionServer.jav= a:1795) > at sun.reflect.GeneratedMethodAccessor22.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorI= mpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at > org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570) > at > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1039= ) > > > > Masterserver: > . > . > . (same thing) > 2011-03-08 16:51:34,679 INFO > org.apache.hadoop.hbase.master.AssignmentManager: Region has been > PENDING_CLOSE for too long, running forced unassign again on > region=3Dusertable,user1948102037,1299592536693.d5bae6bbe54aa182e1215ab62= 6e0011e. > > > -- > deniz > > > On Tue, Mar 8, 2011 at 4:34 PM, M.Deniz OKTAR wro= te: > >> Hi all, >> >> Thanks for the support. I'v been trying to replicate the problem since >> this >> morning. Before doing that, played with the configuration. I used to hav= e >> only one user and set all the permissions according to that. Now I'v >> followed the cloudera manuals and set permissions for hdfs and mapred >> users. >> (changed the hbase-env.sh) >> >> I had 2 trials, on both the yahoo test failed because of receiving lost = of >> "0"s but the region servers didn't die. At some points in the test, (als= o >> when failed) , hbase master gave exceptions about not being able to reac= h >> one of the servers. I also lost the ssh connection to that server, but >> after >> a while it recovered. (also hmaster) The last thing in the regionserver >> logs >> was that it was going for a flush. >> >> I'll be going over the tests again and provide you with clean log files >> from all servers. (hadoop, hbase, namenode, masternode logs) >> >> If you have any suggestions or directions for me to better diagnose the >> problem, that would be lovely. >> >> btw: these servers do not have ECC memory but I do not see any corruptio= n >> in data. >> >> Thanks! >> >> -- >> deniz >> >> >> On Mon, Mar 7, 2011 at 7:47 PM, Jean-Daniel Cryans >> wrote: >> >>> Along with a bigger portion of the log, it be might good to check if >>> there's anything in the .out file that looks like a jvm error. >>> >>> J-D >>> >>> On Mon, Mar 7, 2011 at 9:22 AM, M.Deniz OKTAR >>> wrote: >>> > I run every kind of benchmark I could find on those machines and they >>> seemed >>> > to work fine. Did memory/disk tests too. >>> > >>> > The master node or other nodes provide some information and exception= s >>> about >>> > that they can't reach to the dead node. >>> > >>> > Btw sometimes the process does not die but looses the connection. >>> > >>> > -- >>> > >>> > deniz >>> > >>> > On Mon, Mar 7, 2011 at 7:19 PM, Stack wrote: >>> > >>> >> I'm stumped. I have nothing to go on when no death throes or >>> >> complaints. This hardware for sure is healthy? Other stuff runs w/= o >>> >> issue? >>> >> St.Ack >>> >> >>> >> On Mon, Mar 7, 2011 at 8:48 AM, M.Deniz OKTAR >>> >> wrote: >>> >> > I don't know if its normal but I see alot of '0's in the test >>> >> > results >>> >> when >>> >> > it tends to fail, such as: >>> >> > >>> >> > 1196 sec: 7394901 operations; 0 current ops/sec; >>> >> > >>> >> > -- >>> >> > deniz >>> >> > >>> >> > On Mon, Mar 7, 2011 at 6:46 PM, M.Deniz OKTAR >> > >>> >> wrote: >>> >> > >>> >> >> Hi, >>> >> >> >>> >> >> Thanks for the effort, answers below: >>> >> >> >>> >> >> >>> >> >> >>> >> >> >>> >> >> On Mon, Mar 7, 2011 at 6:08 PM, Stack wrote: >>> >> >> >>> >> >>> On Mon, Mar 7, 2011 at 5:43 AM, M.Deniz OKTAR < >>> deniz.oktar@gmail.com> >>> >> >>> wrote: >>> >> >>> > We have a 5 node cluster, 4 of them being region servers. I am >>> >> running a >>> >> >>> > custom workload with YCSB and when the data is loading (heavy >>> insert) >>> >> at >>> >> >>> > least one of the region servers are dying after about 600000 >>> >> operations. >>> >> >>> >>> >> >>> >>> >> >>> Tell us the character of your 'custom workload' please. >>> >> >>> >>> >> >>> >>> >> >> The workload is below, the part that fails is the loading part >>> (-load) >>> >> >> which inserts all the records first) >>> >> >> >>> >> >> recordcount=3D10000000 >>> >> >> operationcount=3D3000000 >>> >> >> workload=3Dcom.yahoo.ycsb.workloads.CoreWorkload >>> >> >> >>> >> >> readallfields=3Dtrue >>> >> >> >>> >> >> readproportion=3D0.5 >>> >> >> updateproportion=3D0.1 >>> >> >> scanproportion=3D0 >>> >> >> insertproportion=3D0.35 >>> >> >> readmodifywriteproportion=3D0.05 >>> >> >> >>> >> >> requestdistribution=3Dzipfian >>> >> >> >>> >> >> >>> >> >> >>> >> >> >>> >> >>> >>> >> >>> > There are no abnormalities in the logs as far as I can see, th= e >>> only >>> >> >>> common >>> >> >>> > point is that all of them(in different trials, different regio= n >>> >> servers >>> >> >>> > fail) request for a flush as the last logs, given below. .out >>> files >>> >> are >>> >> >>> > empty. I am looking at the /var/log/hbase folder for logs. >>> Running >>> >> sun >>> >> >>> java >>> >> >>> > 6 latest version. I couldn't find any logs that indicates a >>> problem >>> >> with >>> >> >>> > java. Tried the tests with openjdk and had the same results. >>> >> >>> > >>> >> >>> >>> >> >>> Its strange that flush is the last thing in your log. The proce= ss >>> is >>> >> >>> dead? We are exiting w/o a note in logs? Thats unusual. We >>> usually >>> >> >>> scream loudly when dying. >>> >> >>> >>> >> >> >>> >> >> Yes, thats the strange part. The last line is a flush as if the >>> process >>> >> >> never failed. Yes, the process is dead and hbase cannot see the >>> node. >>> >> >> >>> >> >> >>> >> >>> >>> >> >>> > I have set ulimits(50000) and xceivers(20000) for multiple use= rs >>> and >>> >> >>> certain >>> >> >>> > that they are correct. >>> >> >>> >>> >> >>> The first line in an hbase log prints out the ulimit it sees. Y= ou >>> >> >>> might check that the hbase process for sure is picking up your >>> ulimit >>> >> >>> setting. >>> >> >>> >>> >> >>> That was a mistake I did a couple of days ago, checked it with c= at >>> >> >> /proc//limits and all related users like >>> 'hbase' >>> >> has >>> >> >> those limits. Checked the logs: >>> >> >> >>> >> >> Mon Mar 7 06:41:15 EET 2011 Starting regionserver on test-1 >>> >> >> ulimit -n 52768 >>> >> >> >>> >> >>> >>> >> >>> > Also in the kernel logs, there are no apparent problems. >>> >> >>> > >>> >> >>> >>> >> >>> (The mystery compounds) >>> >> >>> >>> >> >>> > 2011-03-07 15:07:58,301 DEBUG >>> >> >>> > org.apache.hadoop.hbase.regionserver.CompactSplitThread: >>> Compaction >>> >> >>> > requested for >>> >> >>> > >>> >> >>> usertable,user1030079237,1299502934627.257739740f58da96d5c5ef51a7d3efc3= . >>> >> >>> > because regionserver60020.cacheFlusher; priority=3D3, compacti= on >>> queue >>> >> >>> size=3D18 >>> >> >>> > 2011-03-07 15:07:58,301 DEBUG >>> >> >>> org.apache.hadoop.hbase.regionserver.HRegion: >>> >> >>> > NOT flushing memstore for region >>> >> >>> > >>> >> >>> >>> >> >>> usertable,user1601881548,1299502135191.f8efb9aa0922fa8a6a53fc49b8155ebc= ., >>> >> >>> > flushing=3Dfalse, writesEnabled=3Dfalse >>> >> >>> > 2011-03-07 15:07:58,301 DEBUG >>> >> >>> org.apache.hadoop.hbase.regionserver.HRegion: >>> >> >>> > Started memstore flush for >>> >> >>> > >>> >> >>> >>> >> >>> usertable,user1662209069,1299502135191.9fa929e6fb439843cffb604dea3f88f6= ., >>> >> >>> > current region memstore size 68.6m >>> >> >>> > 2011-03-07 15:07:58,310 DEBUG >>> >> >>> org.apache.hadoop.hbase.regionserver.HRegion: >>> >> >>> > Flush requested on >>> >> >>> > >>> >> >>> usertable,user1601881548,1299502135191.f8efb9aa0922fa8a6a53fc49b8155ebc= . >>> >> >>> > -end of log file- >>> >> >>> > --- >>> >> >>> > >>> >> >>> >>> >> >>> Nothing more? >>> >> >>> >>> >> >>> >>> >> >> No, nothing after that. But quite a lot of logs before that, I ca= n >>> send >>> >> >> them if you'd like. >>> >> >> >>> >> >> >>> >> >> >>> >> >>> Thanks, >>> >> >>> St.Ack >>> >> >>> >>> >> >> >>> >> >> Thanks alot! >>> >> >> >>> >> >> >>> >> > >>> >> >>> > >>> >> >> > --=20 =E4=BB=8E=E6=88=91=E7=9A=84=E7=A7=BB=E5=8A=A8=E8=AE=BE=E5=A4=87=E5=8F=91=E9= =80=81 Thanks & Best regards jiajun