Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 6AE906010 for ; Tue, 28 Jun 2011 21:15:26 +0000 (UTC) Received: (qmail 86900 invoked by uid 500); 28 Jun 2011 21:15:25 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 86787 invoked by uid 500); 28 Jun 2011 21:15:24 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 86778 invoked by uid 99); 28 Jun 2011 21:15:23 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 28 Jun 2011 21:15:23 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=FREEMAIL_FROM,RCVD_IN_DNSWL_LOW,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of jdcryans@gmail.com designates 74.125.83.41 as permitted sender) Received: from [74.125.83.41] (HELO mail-gw0-f41.google.com) (74.125.83.41) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 28 Jun 2011 21:15:17 +0000 Received: by gwaa12 with SMTP id a12so318199gwa.14 for ; Tue, 28 Jun 2011 14:14:56 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:content-type :content-transfer-encoding; bh=e2XTw2br4Ghuwc0gNKlbC8sC/DzHwsOoYTmE/r26F7M=; b=P14bRWbos1s6+2kr2YII7D+dXD/dlr7gPXw/lm0XCYQlqw+Te5MUiTnPiI/Wb0riB7 UTZh8mbo0dP6+dNgFYBjWzCNswhgcCwmfUPhxjwNCwH+6hVuwQn3YbFqadKX2vrWLydF N2M4lYmrnUChfQkVXXzQjIb1C8vW0qZ/tWAf0= MIME-Version: 1.0 Received: by 10.101.205.36 with SMTP id h36mr21936anq.74.1309295696328; Tue, 28 Jun 2011 14:14:56 -0700 (PDT) Sender: jdcryans@gmail.com Received: by 10.100.226.14 with HTTP; Tue, 28 Jun 2011 14:14:56 -0700 (PDT) In-Reply-To: References: Date: Tue, 28 Jun 2011 14:14:56 -0700 X-Google-Sender-Auth: QsYFUUhiK9yP9x-D5IUbKZ21zQc Message-ID: Subject: Re: Master died after failed assignment of regionserver From: Jean-Daniel Cryans To: user@hbase.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org Looks like I'll be fixing it in the context of HBASE-3984. J-D On Mon, Jun 27, 2011 at 10:21 AM, Jean-Daniel Cryans wrote: > Yeah the BulkAssigner uses a AssignmentManager.assign method that > won't retry if it gets an exception because originally it was only > used during startup (eg if you can't assign the regions properly at > startup then kill yourself). > > In this case it happened to you when creating a table (it also uses > the BulkAssigner now). To me it's a bug, it should just be assigned > somewhere else. Mind filing a jira? > > As to how you got a 60 secs timeout, you should take a look at the RS > log (and next time it happens also do a jstack). > > Thx, > > J-D > > On Mon, Jun 27, 2011 at 9:54 AM, Matt Davies wrote= : >> Hey all, >> >> Interesting problem I'm running into. =A0It appears that our master has = died >> after a failed region assignment. =A0Any ideas as to why? >> >> >> We are running version 0.90.1 in this cluster. >> H3M1 =3D master >> H3S* =3D regionservers >> >> out on H3S3 the regionserver process is running (but didn't respond per = the >> log) >> >> 2011-06-27 10:18:22,517 INFO >> org.apache.hadoop.hbase.master.handler.DisableTableHandler: Attemping to >> disable table ContentGuidance >> 2011-06-27 10:18:22,522 INFO >> org.apache.hadoop.hbase.master.handler.DisableTableHandler: Offlining 1 >> regions. >> 2011-06-27 10:18:23,527 INFO >> org.apache.hadoop.hbase.master.handler.DisableTableHandler: Disabled tab= le >> is done=3Dtrue >> 2011-06-27 10:18:24,536 INFO >> org.apache.hadoop.hbase.master.handler.TableEventHandler: Handling table >> operation C_M_DELETE_TABLE on table ContentGuidance >> 2011-06-27 10:18:24,584 INFO org.apache.hadoop.hbase.catalog.MetaEditor: >> Deleted region >> ContentGuidance,,1298676132327.9d63b6173c7307c27aa9728f738f0ef3. from ME= TA >> 2011-06-27 10:19:17,296 WARN org.apache.hadoop.hbase.zookeeper.ZKTable: >> Moving table ContentGuidance state to enabled but was already enabled >> 2011-06-27 10:19:17,313 INFO org.apache.hadoop.hbase.regionserver.wal.HL= og: >> HLog configuration: blocksize=3D64 MB, rollsize=3D60.8 MB, enabled=3Dtru= e, >> flushlogentries=3D1, optionallogflushinternal=3D1000ms >> 2011-06-27 10:19:17,349 INFO >> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter: Using sy= ncFs >> -- HDFS-200 >> 2011-06-27 10:19:17,350 INFO org.apache.hadoop.hbase.regionserver.wal.HL= og: >> New hlog >> /hbase/ContentGuidance/f38bbc9bc672b5ea415b6dcc7e1da43a/.logs/hlog.13091= 91557313 >> 2011-06-27 10:19:17,350 INFO org.apache.hadoop.hbase.regionserver.wal.HL= og: >> Using getNumCurrentReplicas--HDFS-826 >> 2011-06-27 10:19:17,433 INFO org.apache.hadoop.hbase.regionserver.HRegio= n: >> Onlined ContentGuidance,,1309191557292.f38bbc9bc672b5ea415b6dcc7e1da43a.= ; >> next sequenceid=3D1 >> 2011-06-27 10:19:17,438 INFO org.apache.hadoop.hbase.catalog.MetaEditor: >> Added region >> ContentGuidance,,1309191557292.f38bbc9bc672b5ea415b6dcc7e1da43a. to META >> 2011-06-27 10:19:17,439 INFO org.apache.hadoop.hbase.regionserver.HRegio= n: >> Closed ContentGuidance,,1309191557292.f38bbc9bc672b5ea415b6dcc7e1da43a. >> 2011-06-27 10:19:17,439 INFO org.apache.hadoop.hbase.regionserver.wal.HL= og: >> IPC Server handler 87 on 60000.logSyncer exiting >> 2011-06-27 10:19:17,470 INFO >> org.apache.hadoop.hbase.master.AssignmentManager: Bulk assigning 1 regio= n(s) >> round-robin across 4 server(s) >> 2011-06-27 10:19:17,472 INFO >> org.apache.hadoop.hbase.master.AssignmentManager: Bulk assigning done >> 2011-06-27 10:19:17,480 INFO >> org.apache.hadoop.hbase.master.AssignmentManager: H3S4,60020,13089466577= 03 >> unassigned znodes=3D1 of total=3D1 >> 2011-06-27 10:19:21,530 INFO >> org.apache.hadoop.hbase.master.handler.DisableTableHandler: Attemping to >> disable table ContentGuidance >> 2011-06-27 10:19:21,536 INFO >> org.apache.hadoop.hbase.master.handler.DisableTableHandler: Offlining 1 >> regions. >> 2011-06-27 10:19:22,543 INFO >> org.apache.hadoop.hbase.master.handler.DisableTableHandler: Disabled tab= le >> is done=3Dtrue >> 2011-06-27 10:19:23,544 INFO >> org.apache.hadoop.hbase.master.handler.TableEventHandler: Handling table >> operation C_M_DELETE_TABLE on table ContentGuidance >> 2011-06-27 10:19:23,551 INFO org.apache.hadoop.hbase.catalog.MetaEditor: >> Deleted region >> ContentGuidance,,1309191557292.f38bbc9bc672b5ea415b6dcc7e1da43a. from ME= TA >> 2011-06-27 10:20:17,479 FATAL org.apache.hadoop.hbase.master.HMaster: Fa= iled >> assignment of regions to serverName=3DH3S3,60020,1308946657608, >> load=3D(requests=3D0, regions=3D7, usedHeap=3D976, maxHeap=3D1011) >> java.net.SocketTimeoutException: Call to H3S3/10..x.x.x3:60020 failed on >> socket timeout exception: java.net.SocketTimeoutException: 60000 millis >> timeout while waiting for channel to be ready for read. ch : >> java.nio.channels.SocketChannel[connected local=3D/10..x.x.x:34904 >> remote=3DH3S3/10.x.x.x:60020] >> at >> org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:8= 02) >> at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:775) >> at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257= ) >> at $Proxy7.openRegions(Unknown Source) >> at >> org.apache.hadoop.hbase.master.ServerManager.sendRegionOpen(ServerManage= r.java:566) >> at >> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManage= r.java:776) >> at >> org.apache.hadoop.hbase.master.AssignmentManager$SingleServerBulkAssigne= r.run(AssignmentManager.java:1310) >> at >> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecuto= r.java:886) >> at >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.ja= va:908) >> at java.lang.Thread.run(Thread.java:619) >> Caused by: java.net.SocketTimeoutException: 60000 millis timeout while >> waiting for channel to be ready for read. ch : >> java.nio.channels.SocketChannel[connected local=3D/10..x.x.x:34904 >> remote=3DH3S3/10..x.x.x:60020] >> at >> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:= 164) >> at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:1= 55) >> at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:1= 28) >> at java.io.FilterInputStream.read(FilterInputStream.java:116) >> at >> org.apache.hadoop.hbase.ipc.HBaseClient$Connection$PingInputStream.read(= HBaseClient.java:299) >> at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) >> at java.io.BufferedInputStream.read(BufferedInputStream.java:237) >> at java.io.DataInputStream.readInt(DataInputStream.java:370) >> at >> org.apache.hadoop.hbase.ipc.HBaseClient$Connection.receiveResponse(HBase= Client.java:539) >> at >> org.apache.hadoop.hbase.ipc.HBaseClient$Connection.run(HBaseClient.java:= 477) >> 2011-06-27 10:20:17,482 INFO org.apache.hadoop.hbase.master.HMaster: >> Aborting >> 2011-06-27 10:20:18,390 INFO org.apache.hadoop.ipc.HBaseServer: Stopping >> server on 60000 >> 2011-06-27 10:20:18,390 INFO org.apache.hadoop.hbase.master.CatalogJanit= or: >> H3M1:60000-CatalogJanitor exiting >> 2011-06-27 10:20:18,390 INFO org.apache.hadoop.hbase.master.HMaster$1: >> H3M1:60000-BalancerChore exiting >> 2011-06-27 10:20:18,392 INFO org.apache.hadoop.ipc.HBaseServer: Stopping= IPC >> Server listener on 60000 >> >