Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id BE1054F6C for ; Thu, 7 Jul 2011 10:42:14 +0000 (UTC) Received: (qmail 30443 invoked by uid 500); 7 Jul 2011 10:42:13 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 29776 invoked by uid 500); 7 Jul 2011 10:41:58 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 29755 invoked by uid 99); 7 Jul 2011 10:41:54 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 07 Jul 2011 10:41:54 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of yuzhihong@gmail.com designates 209.85.160.41 as permitted sender) Received: from [209.85.160.41] (HELO mail-pw0-f41.google.com) (209.85.160.41) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 07 Jul 2011 10:41:47 +0000 Received: by pwi12 with SMTP id 12so723310pwi.14 for ; Thu, 07 Jul 2011 03:41:26 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:date:message-id:subject:from:to:content-type; bh=qMM/GjncReU9MIJluNOigZhKFRChYZh/hsmw0SglaTk=; b=FglxVVo5jAnB0W5ggSlaljV8T9lpP0KsF//7Rl3KTCpALILYIi1WeS/7rTaY4wIjC9 pjoSV2ajkf59UZI4cb37G94ZZE0lQvWDw/XXPMqdFhPCMVDLtm1MTNIC7u5sDLDrLLQU p8pVCveqmGH4ESn1SgQcRr+7LZVc8+4+7dGrA= MIME-Version: 1.0 Received: by 10.68.24.102 with SMTP id t6mr840012pbf.503.1310035286048; Thu, 07 Jul 2011 03:41:26 -0700 (PDT) Received: by 10.68.40.4 with HTTP; Thu, 7 Jul 2011 03:41:26 -0700 (PDT) Date: Thu, 7 Jul 2011 03:41:26 -0700 Message-ID: Subject: double assignment WAS: Errors after major compaction From: Ted Yu To: user@hbase.apache.org Content-Type: multipart/alternative; boundary=bcaec531478dde410e04a7785b2a X-Virus-Checked: Checked by ClamAV on apache.org --bcaec531478dde410e04a7785b2a Content-Type: text/plain; charset=ISO-8859-1 >> Mind pastebin'ing this part of master log? 2011-06-29 16:39:54,326 DEBUG org.apache.hadoop.hbase. master.handler.OpenedRegionHandler: Opened region gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. on hadoop1-s05.farm-ny.gigya.com,60020,1307349217076 2011-06-29 16:40:00,598 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:60000-0x13004a31d7804c4 Creating (or updating) unassigned node for 584dac5cc70d8682f71c4675a843c309 with OFFLINE state Eran: Was there more log between the two lines in master log ? TimeoutMonitor.chore() should have logged something if it caused region re-assignment. Thanks On Wed, Jul 6, 2011 at 10:52 PM, Stack wrote: > On Sun, Jul 3, 2011 at 12:02 PM, Eran Kutner wrote: > > 4. Then at 16:40:00 the master log says: master:60000-0x13004a31d7804c4 > > Creating (or updating) unassigned node for 584dac5cc70d8682f71c4675a843c3 > > 09 with OFFLINE state - why did it decide to take the region offline > after > > learning it was successfully opened? > > > My guess is that though we'd opened the region, the timeout of regions > in transition expired and it we queued assigning it elsewhere (The > first step in assigning a region elsewhere is putting the regions > znode into the OFFLINE state). Mind pastebin'ing this part of master > log? > > The issues Ted cites and the fix racyness issue I added to it are > about cutting down the span over which locks are held in the master -- > this has made for big improvements in the promptness with which the > master processes state transitions -- and then there are races between > the handling of region transitions -- e.g. opens -- down in the region > transition handlers and the running of the timeout monitor. These are > whats being addressed. > > > 5. Then it tries to reopen the region on hadoop1-s05, which indicates in > its > > log that the open request failed because the region was already open - > why > > didn't the master use that information to learn that the region was > already > > open? > > It looks like we log it as WARN on the regionserver side but do > nothing else with it. Here is the message: > > 2011-06-29 16:40:01,079 WARN > org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: > Attempted open of > > gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. > but already online on this server > > We notice we already have it opened down in the open region handler > down in the regionserver. We've let go of the connection to the > master at this stage so no way of our flagging the master that we > already have this region. What we should do is before we queue it, > check if we already have it and return the master an > AlreadyOpenException (I made HBASE-4073 to make sure we don't forget > about this one -- the root issue needs addressing but thereafter, we > should never queue the opening of a region we already have opened on > the regionserver) > > > > 7. Now the master forces the transition of the region to hadoop1-s02 but > > there is no sign of that on hadoop1-s05 - why doesn't the old RS > > (hadoop1-s05) detect that it is no longer the master and relinquishes > > control of the region? > > > Well, the master doesn't know that s05 has the region open -- thats > why it gives it to s02 -- and then, there is no channel available to > s05 to figure who has what. > > St.Ack > --bcaec531478dde410e04a7785b2a--