Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 7DBB59FE4 for ; Thu, 19 Jan 2012 22:16:13 +0000 (UTC) Received: (qmail 25780 invoked by uid 500); 19 Jan 2012 22:16:11 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 25659 invoked by uid 500); 19 Jan 2012 22:16:11 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 25651 invoked by uid 99); 19 Jan 2012 22:16:11 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 19 Jan 2012 22:16:11 +0000 X-ASF-Spam-Status: No, hits=-0.5 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of dlieu.7@gmail.com designates 209.85.212.41 as permitted sender) Received: from [209.85.212.41] (HELO mail-vw0-f41.google.com) (209.85.212.41) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 19 Jan 2012 22:16:05 +0000 Received: by vbbfa15 with SMTP id fa15so449179vbb.14 for ; Thu, 19 Jan 2012 14:15:45 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; bh=BDs3yM3/ysu75SqOifC6Yc3iYJknqaIeREJsSu0izXg=; b=GNppzCVzfeshQ1TRP7w4f87ZRptqQjGrUSgjbJnt6jQnBv0WuPG78IWhpv+9hvZcNo nHE3ozkjggA8JX/mt8arLZxf7KtXjwc9Cf6qCNfSkn5dPqyjTQytuDUzknNWUg8GUJ9K pgLgov/EaFvr/Wz5GVpK3SNkVJp8l9w7E18U8= MIME-Version: 1.0 Received: by 10.52.172.196 with SMTP id be4mr13252621vdc.80.1327011344934; Thu, 19 Jan 2012 14:15:44 -0800 (PST) Received: by 10.52.74.165 with HTTP; Thu, 19 Jan 2012 14:15:44 -0800 (PST) In-Reply-To: References: Date: Thu, 19 Jan 2012 14:15:44 -0800 Message-ID: Subject: Re: Table region got stuck, doesn't move/assign From: Dmitriy Lyubimov To: user@hbase.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable thank you, Michael. problem is solved (for now) by moving region out after restarting the region server although we don't really know the reason why and what happened to that region. Region server got stuck on any requests to a particular region and only that one. Master was ok as i realized later. Why it couldn't immediately move the region, i am mot sure; but as soon as we restarted the region server and switched table offline/online, it was able to complete move /reassign the region. The real problem was that it happened to one (apparently random) region in a region server but not others. Symptoms were region server hanging, not returning any scan requests to that region (but not others). the condition persisted for a long time (several days) and we did not figure it out until we caught several jobs of low importance timing out on reading from the table containing that region. The table experiences asychronous reads and regular write updates (it's actually a part of HBL cube). I think there's really low chance we'll ever get down to the bottom of it, so we dropped any further triage attempts at this point. I guess we just also need to upgrade our hbase stack in prod. Thank you very much, sir. -d On Wed, Jan 18, 2012 at 9:34 AM, Stack wrote: > On Mon, Jan 16, 2012 at 3:45 PM, Dmitriy Lyubimov wro= te: >> i have a table which seems to get stuck in a state where it can't be >> queried, moved or split/compacted. >> > > How many regions in this table? =A0One only? > >> The logs don't have any error statements. Our admin tried hbck to no ava= il . >> > > What did your admin see? > > >> We stopped the region server, table did not get reassigned. (all other >> did). when bround in UI, this table just showed "region server >> offline". (??? shouldn't get reassigned as others did?) >> > > Yes. =A0It should. > >> Brining region server online loaded it with other regions, but not >> that table. master apparently still thinks it is on that node (data6) >> and so all requests are failing with region not serving message. >> > > > So, there is something 'wrong' w/ that table. =A0 Can you track it in > master log and see what happens when master tries assign it? =A0Maybe > its failing to open? > >> assign/move/ unassign commands have no effect (move fails, but >> assing/unassign seems to be quiet with no apparent effect). >> >> Another weirdness: it's the only table that is showing up under >> hbase/table in zk and its region is listed under /hbase/unassigned. >> > > > Maybe its stuck in transition? =A0You should see messages in master log > if this the case. > >> Where can i read about meaning and transitions of zookeeper nodes under = /hbase ? >> > > I don't think this documented in the reference guide (its a little too > much detail for most I'd say). =A0Best place to look is probably source > code. =A0See here for an entrance into the wonderful world of > master/regionserver state transitions: > http://hbase.apache.org/xref/org/apache/hadoop/hbase/executor/EventHandle= r.html#93 > > St.Ack