Return-Path: Delivered-To: apmail-hadoop-hbase-dev-archive@minotaur.apache.org Received: (qmail 47703 invoked from network); 26 Jan 2010 03:24:19 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 26 Jan 2010 03:24:19 -0000 Received: (qmail 41758 invoked by uid 500); 26 Jan 2010 03:24:18 -0000 Delivered-To: apmail-hadoop-hbase-dev-archive@hadoop.apache.org Received: (qmail 41710 invoked by uid 500); 26 Jan 2010 03:24:18 -0000 Mailing-List: contact hbase-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hbase-dev@hadoop.apache.org Delivered-To: mailing list hbase-dev@hadoop.apache.org Received: (qmail 41700 invoked by uid 99); 26 Jan 2010 03:24:18 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 26 Jan 2010 03:24:18 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of jdcryans@gmail.com designates 209.85.210.194 as permitted sender) Received: from [209.85.210.194] (HELO mail-yx0-f194.google.com) (209.85.210.194) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 26 Jan 2010 03:24:10 +0000 Received: by yxe32 with SMTP id 32so8257002yxe.5 for ; Mon, 25 Jan 2010 19:23:49 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:sender:received:in-reply-to :references:date:x-google-sender-auth:message-id:subject:from:to :content-type:content-transfer-encoding; bh=bVFQb38fr28J7WKTs5tkVwnsHo/WPlDL263enCa9pIs=; b=P/o0uvrulFUWKDF+4KhQHHub54bE4s6xZvrjVhQfJ4vUZS298xQ/ngZ6dEjkbyiGbs li8vWQBGfhL0LWwiKQucD12MrAvVdX9p5S4ROR/FYZkkfEzSKsreTuS6UCraOSKnTE2j y6aKN1FuC6SfyBTB2M5ISFGECkWHKxFUCfvUs= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:content-type :content-transfer-encoding; b=dRXIyBYRTHdjXBcSKjsd5NC7XMxEuyeTIrmXR63mi3S9uocBK1bEsz0UYNbZhwGGLW zLlqkbOXYgrHFN/PghWFj48CqcKPD8E2VleScmhe2bSTtJ1wn8x4271r2CeycHsZgryX Tk28ZB+qSBlzEnpe/qoFvFbrhFC1mtEyi/dBc= MIME-Version: 1.0 Sender: jdcryans@gmail.com Received: by 10.91.74.18 with SMTP id b18mr329558agl.39.1264476229096; Mon, 25 Jan 2010 19:23:49 -0800 (PST) In-Reply-To: <7BC93F71D08F104A907F273FC6FF60AA6CEC2CD9C5@SC-MBXC1.TheFacebook.com> References: <7BC93F71D08F104A907F273FC6FF60AA6CEC2CD9C5@SC-MBXC1.TheFacebook.com> Date: Mon, 25 Jan 2010 19:23:49 -0800 X-Google-Sender-Auth: be1f0a06d0dcefc6 Message-ID: <31a243e71001251923h5bdc0653p9ed4e5a6c5fc2280@mail.gmail.com> Subject: Re: scanner lease expired/region server shutdown From: Jean-Daniel Cryans To: hbase-dev@hadoop.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org I've seen this before, this is a GC pause that took more than a minute. It was enough to time out the scanner leases and zookeeper session. J-D On Mon, Jan 25, 2010 at 7:14 PM, Kannan Muthukkaruppan wrote: > I was doing some brute force testing - running one instance of Performanc= eEvaluation (PE) for writes, and another instance for randomReads. > > One of the region servers went down after a while. [This is on 0.20.2]. T= he region server log had things like: > > 2010-01-25 11:33:39,416 WARN org.apache.hadoop.hbase.regionserver.HLog: I= PC Server handler 34 on 60020 took 65190ms appending an edit to hlog; editc= ount=3D27878 > 2010-01-25 11:33:39,416 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCa= che: Cache Stats: Sizes: Total=3D1815.4441MB (1903631104), Free=3D581.23096= MB (609464832),\ > =A0Max=3D2396.675MB (2513095936), Counts: Blocks=3D28216, Access=3D109046= 89, Hit=3D9313612, Miss=3D1591077, Evictions=3D110, Evicted=3D414336, Ratio= s: Hit Ratio=3D85.40924191474\ > 915%, Miss Ratio=3D14.590759575366974%, Evicted/Run=3D3766.69091796875 > 2010-01-25 11:33:39,417 WARN org.apache.hadoop.hbase.regionserver.HLog: I= PC Server handler 52 on 60020 took 65195ms appending an edit to hlog; editc= ount=3D27879 > 2010-01-25 11:33:39,417 WARN org.apache.hadoop.hbase.regionserver.HLog: I= PC Server handler 75 on 60020 took 65195ms appending an edit to hlog; editc= ount=3D27880 > 2010-01-25 11:33:39,417 INFO org.apache.hadoop.hbase.regionserver.HRegion= Server: Scanner -5220035164735014996 lease expired > > This was followed by: > 2010-01-25 11:33:39,636 WARN org.apache.zookeeper.ClientCnxn: Exception c= losing session 0x12652c620520001 to sun.nio.ch.SelectionKeyImpl@788ab708 > java.io.IOException: TIMED OUT > > And also errors like: > 2010-01-25 11:33:39,692 ERROR org.apache.hadoop.hbase.regionserver.HRegio= nServer: > org.apache.hadoop.hbase.UnknownScannerException: Name: -52200351647350149= 96 > > Eventually, the region server shutdown. > 2010-01-25 11:33:40,293 INFO org.apache.hadoop.ipc.HBaseServer: IPC Serve= r handler 85 on 60020, call get([B@2eec23d2, row=3D0007555231, maxVersions= =3D1, timeRange=3D[0,9223372036854775807), families=3D{(family=3Dinfo, colu= mns=3D{data}}) from 10.131.1.186:51645: error: java.io.IOException: Server = not running, aborting > > Looks like for some reason the region server got overloaded, and was not = able to finish the scan (I am assuming this is the period meta/root region = scan initiated by the master). > > Any thoughts/ideas on the above? > > Subsequently, I restarted the region server and just started the PE in se= quentialWrite mode. It seems to be off to a really slow start. For several = regions, it is printing the following: > > 10/01/25 18:12:44 DEBUG client.HConnectionManager$TableServers: locateReg= ionInMeta attempt 0 of 10 failed; retrying after sleep of 1000 because: reg= ion offline: TestTable,0031841132,1264121653303 > 10/01/25 18:12:44 DEBUG client.HConnectionManager$TableServers: Removed .= META.,,1 for tableName=3D.META. from cache because of TestTable,0032000000,= 99999999999999 > 10/01/25 18:12:44 DEBUG client.HConnectionManager$TableServers: Cached lo= cation for .META.,,1 is 10.129.68.212:60020 > 10/01/25 18:12:45 DEBUG client.HConnectionManager$TableServers: locateReg= ionInMeta attempt 1 of 10 failed; retrying after sleep of 1000 because: reg= ion offline: TestTable,0031841132,1264121653303 > 10/01/25 18:12:45 DEBUG client.HConnectionManager$TableServers: Removed .= META.,,1 for tableName=3D.META. from cache because of TestTable,0032000000,= 99999999999999 > [..and so on..] > 10/01/25 18:12:59 DEBUG client.HConnectionManager$TableServers: Cached lo= cation for .META.,,1 is 10.129.68.212:60020 > 10/01/25 18:13:07 DEBUG client.HConnectionManager$TableServers: locateReg= ionInMeta attempt 8 of 10 failed; retrying after sleep of 16000 because: re= gion offline: TestTable,0031841132,1264121653303 > 10/01/25 18:13:07 DEBUG client.HConnectionManager$TableServers: Removed .= META.,,1 for tableName=3D.META. from cache because of TestTable,0032000000,= 99999999999999 > > When I scan the '.META.', I noticed that most of the regions for which I = get the above error were in the middle of (?) a split. For example: > > TestTable,0031841132,126412 column=3Dinfo:regioninfo, timestamp=3D1264470= 099796, value=3DREGION =3D> {NAME =3D> 'TestTa > =A01653303 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 ble,0031841132,1264121= 653303', STARTKEY =3D> '0031841132', ENDKEY =3D> '0032057774', > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 ENCODED =3D> 6445= 00792, OFFLINE =3D> true, SPLIT =3D> true, TABLE =3D> {{NAME =3D> 'TestTab > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 le', FAMILIES =3D= > [{NAME =3D> 'info', VERSIONS =3D> '3', COMPRESSION =3D> 'NONE', TTL =3D> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0'2147483647', = BLOCKSIZE =3D> '65536', IN_MEMORY =3D> 'false', BLOCKCACHE =3D> 'true'}] > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 }} > =A0TestTable,0031841132,126412 column=3Dinfo:server, timestamp=3D12641216= 55148, value=3D10.129.68.214:60020 > =A01653303 > =A0TestTable,0031841132,126412 column=3Dinfo:serverstartcode, timestamp= =3D1264121655148, value=3D1264109117245 > =A01653303 > =A0TestTable,0031841132,126412 column=3Dinfo:splitA, timestamp=3D12644700= 99796, value=3D\000\n0031978284\000\000\000\0 > =A01653303 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 01&g\016\357\277\275\0= 24\"TestTable,0031841132,1264449401108\000\n0031841132\000\ > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 000\000\005\tTest= Table\000\000\000\000\000\002\000\000\000\aIS_ROOT\000\000\000\0 > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 05false\000\000\0= 00\aIS_META\000\000\000\005false\000\000\000\001\a\004info\000\0 > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 00\000\a\000\000\= 000\vBLOOMFILTER\000\000\000\005false\000\000\000\vCOMPRESSION\0 > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 00\000\000\004NON= E\000\000\000\bVERSIONS\000\000\000\0013\000\000\000\003TTL\000\ > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 000\000\n21474836= 47\000\000\000\tBLOCKSIZE\000\000\000\00565536\000\000\000\tIN_M > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 EMORY\000\000\000= \005false\000\000\000\nBLOCKCACHE\000\000\000\004true\357\277\27 > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 5\337\274# > =A0TestTable,0031841132,126412 column=3Dinfo:splitB, timestamp=3D12644700= 99796, value=3D\000\n0032057774\000\000\000\0 > =A01653303 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 01&g\016\357\277\275\0= 24\"TestTable,0031978284,1264449401108\000\n0031978284\000\ > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 000\000\005\tTest= Table\000\000\000\000\000\002\000\000\000\aIS_ROOT\000\000\000\0 > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 05false\000\000\0= 00\aIS_META\000\000\000\005false\000\000\000\001\a\004info\000\0 > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 00\000\a\000\000\= 000\vBLOOMFILTER\000\000\000\005false\000\000\000\vCOMPRESSION\0 > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 00\000\000\004NON= E\000\000\000\bVERSIONS\000\000\000\0013\000\000\000\003TTL\000\ > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 000\000\n21474836= 47\000\000\000\tBLOCKSIZE\000\000\000\00565536\000\000\000\tIN_M > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 EMORY\000\000\000= \005false\000\000\000\nBLOCKCACHE\000\000\000\004truej\357\277\2 > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 75\312\262 > > Any thoughts/ideas on what might be going on? Appreciate any input in hel= ping understand what might be going on. > > Regards, > Kannan > > >