Return-Path: Delivered-To: apmail-lucene-hadoop-dev-archive@locus.apache.org Received: (qmail 31446 invoked from network); 28 Nov 2007 22:25:14 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 28 Nov 2007 22:25:13 -0000 Received: (qmail 87037 invoked by uid 500); 28 Nov 2007 22:25:00 -0000 Delivered-To: apmail-lucene-hadoop-dev-archive@lucene.apache.org Received: (qmail 86993 invoked by uid 500); 28 Nov 2007 22:25:00 -0000 Mailing-List: contact hadoop-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hadoop-dev@lucene.apache.org Delivered-To: mailing list hadoop-dev@lucene.apache.org Received: (qmail 86962 invoked by uid 99); 28 Nov 2007 22:24:59 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 28 Nov 2007 14:24:59 -0800 X-ASF-Spam-Status: No, hits=-100.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.4] (HELO brutus.apache.org) (140.211.11.4) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 28 Nov 2007 22:24:47 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 8B82E714240 for ; Wed, 28 Nov 2007 14:24:43 -0800 (PST) Message-ID: <13933257.1196288683568.JavaMail.jira@brutus> Date: Wed, 28 Nov 2007 14:24:43 -0800 (PST) From: "Jim Kellerman (JIRA)" To: hadoop-dev@lucene.apache.org Subject: [jira] Commented: (HADOOP-2295) [hbase] Still multi-assigning regions In-Reply-To: <11993953.1196206123375.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/HADOOP-2295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12546440 ] Jim Kellerman commented on HADOOP-2295: --------------------------------------- When the master assigned a region twice to the same server, what happened was that on the second assignment, the region server returned two 'region is opened' messages in the same message to the master (code bug in HRegionServer). The first message was processed properly and the master queued a task to update the meta. The master then looped to process the second message, and now the region was no longer in the 'unassigned regions' list so the master told the region server to close it. In the mean time, the update of the meta completes, clients start trying to access the region but the region server has closed the region so the client gets a NotServingRegion exception. The master was doing the 'right thing' by reassigning the region because the region server took longer than 'hbase.hbasemaster.maxregionopen' (30 seconds) to open the region. Fixing the aforementioned bug will prevent this from occurring, but if the master should assign the same region to two different region servers, this could easily result in corrupting the region beyond repair as there is a lot of file i/o associated with bringing a region on-line. To avoid this, we propose that a region server acknowledge that it is opening the region, and when the master receives the ack, it extends the region reassignment timeout. If the region server subsequently fails to open the region, the master can reassign the region after the additional grace period and we can be pretty certain that there will be no interference between multiple region servers opening the same region. > [hbase] Still multi-assigning regions > ------------------------------------- > > Key: HADOOP-2295 > URL: https://issues.apache.org/jira/browse/HADOOP-2295 > Project: Hadoop > Issue Type: Bug > Components: contrib/hbase > Reporter: stack > Assignee: Jim Kellerman > > Start key is img216/1725/22236401lvh6.gif. We doubly assign. In between assignments, master rules 'should not have opened region' and sends a shutdown region to remote server -- which it obeys. Meantime, master thinks regions are deployed. > Here is master log: > {code} > ... > 2007-11-27 04:26:21,818 INFO hbase.HMaster - region postlog,img216/1725/22236401lvh6.gif,1196164441230 split. New regions are: postlog,img216/1725/22236401lvh6.gif,1196166373235, postlog,img216/3961/aasaxonlivene6.jpg,1196166373241 > 2007-11-27 04:26:23,261 INFO hbase.HMaster - assigning region postlog,img216/1725/22236401lvh6.gif,1196166373235 to server XX.XX.XX.33:60020 > 2007-11-27 04:26:27,462 DEBUG hbase.HMaster - HMaster.metaScanner scanner: -400573094926830852 regioninfo: {regionname: postlog,img216/1725/22236401lvh6.gif,1196164441230, startKey: , offline: true, split: true, tableDesc: {name: postlog, families: {cookie:={name: cookie, max versions: 1, compression: NONE, in memory: false, max length: 2147483647, bloom filter: none}, ip:={name: ip, max versions: 1, compression: NONE, in memory: false, max length: 2147483647, bloom filter: none}}}}, server: XX.XX.XX.25:60020, startCode: 1196155489036 > 2007-11-27 04:26:27,464 DEBUG hbase.HMaster - HMaster.metaScanner scanner: -400573094926830852 regioninfo: {regionname: postlog,img216/1725/22236401lvh6.gif,1196166373235, startKey: , tableDesc: {name: postlog, families: {cookie:={name: cookie, max versions: 1, compression: NONE, in memory: false, max length: 2147483647, bloom filter: none}, ip:={name: ip, max versions: 1, compression: NONE, in memory: false, max length: 2147483647, bloom filter: none}}}}, server: , startCode: -1 > 2007-11-27 04:26:27,464 DEBUG hbase.HMaster - Checking postlog,img216/1725/22236401lvh6.gif,1196166373235 is assigned > 2007-11-27 04:26:27,671 DEBUG hbase.HMaster - Checked postlog,img216/1725/22236401lvh6.gif,1196164441230 for references: splitA: true, splitB: true > 2007-11-27 04:26:53,328 INFO hbase.HMaster - assigning region postlog,img216/1725/22236401lvh6.gif,1196166373235 to server XX.XX.XX.33:60020 > 2007-11-27 04:27:05,364 INFO hbase.HMaster - XX.XX.XX.33:60020 serving postlog,img216/1725/22236401lvh6.gif,1196166373235 > 2007-11-27 04:27:05,364 DEBUG hbase.HMaster - region server XX.XX.XX.33:60020 should not have opened region postlog,img216/1725/22236401lvh6.gif,1196166373235 > 2007-11-27 04:27:05,367 INFO hbase.HMaster - regionname: postlog,img216/1725/22236401lvh6.gif,1196166373235, startKey: , tableDesc: {name: postlog, families: {cookie:={name: cookie, max versions: 1, compression: NONE, in memory: false, max length: 2147483647, bloom filter: none}, ip:={name: ip, max versions: 1, compression: NONE, in memory: false, max length: 2147483647, bloom filter: none}}} open on XX.XX.XX.33:60020 > 2007-11-27 04:27:05,368 INFO hbase.HMaster - updating row postlog,img216/1725/22236401lvh6.gif,1196166373235 in table .META.,,1 with startcode 1196155493271 and server XX.XX.XX.33:60020 > 2007-11-27 04:27:26,380 DEBUG hbase.HMaster - HMaster.metaScanner scanner: -4876467680083821318 regioninfo: {regionname: postlog,img216/1725/22236401lvh6.gif,1196164441230, startKey: , offline: true, split: true, tableDesc: {name: postlog, families: {cookie:={name: cookie, max versions: 1, compression: NONE, in memory: false, max length: 2147483647, bloom filter: none}, ip:={name: ip, max versions: 1, compression: NONE, in memory: false, max length: 2147483647, bloom filter: none}}}}, server: XX.XX.XX.25:60020, startCode: 1196155489036 > 2007-11-27 04:27:26,381 DEBUG hbase.HMaster - HMaster.metaScanner scanner: -4876467680083821318 regioninfo: {regionname: postlog,img216/1725/22236401lvh6.gif,1196166373235, startKey: , tableDesc: {name: postlog, families: {cookie:={name: cookie, max versions: 1, compression: NONE, in memory: false, max length: 2147483647, bloom filter: none}, ip:={name: ip, max versions: 1, compression: NONE, in memory: false, max length: 2147483647, bloom filter: none}}}}, server: XX.XX.XX.33:60020, startCode: 1196155493271 > 2007-11-27 04:27:26,381 DEBUG hbase.HMaster - Checking postlog,img216/1725/22236401lvh6.gif,1196166373235 is assigned > 2007-11-27 04:27:26,511 DEBUG hbase.HMaster - Checked postlog,img216/1725/22236401lvh6.gif,1196164441230 for references: splitA: true, splitB: true > 2007-11-27 04:28:26,444 DEBUG hbase.HMaster - HMaster.metaScanner scanner: -7699450087241029374 regioninfo: {regionname: postlog,img216/1725/22236401lvh6.gif,1196164441230, startKey: , offline: true, split: true, tableDesc: {name: postlog, families: {cookie:={name: cookie, max versions: 1, compression: NONE, in memory: false, max length: 2147483647, bloom filter: none}, ip:={name: ip, max versions: 1, compression: NONE, in memory: false, max length: 2147483647, bloom filter: none}}}}, server: XX.XX.XX.25:60020, startCode: 1196155489036 > 2007-11-27 04:28:26,446 DEBUG hbase.HMaster - HMaster.metaScanner scanner: -7699450087241029374 regioninfo: {regionname: postlog,img216/1725/22236401lvh6.gif,1196166373235, startKey: , tableDesc: {name: postlog, families: {cookie:={name: cookie, max versions: 1, compression: NONE, in memory: false, max length: 2147483647, bloom filter: none}, ip:={name: ip, max versions: 1, compression: NONE, in memory: false, max length: 2147483647, bloom filter: none}}}}, server: XX.XX.XX.33:60020, startCode: 1196155493271 > 2007-11-27 04:28:26,446 DEBUG hbase.HMaster - Checking postlog,img216/1725/22236401lvh6.gif,1196166373235 is assigned > 2007-11-27 04:28:26,590 DEBUG hbase.HMaster - Checked postlog,img216/1725/22236401lvh6.gif,1196164441230 for references: splitA: true, splitB: true > 2007-11-27 04:29:26,459 DEBUG hbase.HMaster - HMaster.metaScanner scanner: 5846860608957952540 regioninfo: {regionname: postlog,img216/1725/22236401lvh6.gif,1196164441230, startKey: , offline: true, split: true, tableDesc: {name: po: 1, compression: NONE, in memory: false, max length: 21474836s: 1, compression: NONE, in memory: false, max length: 21474836er: none}}}}, server: XX.XX.XX.25:60020, startCode: 11961554890ase.HMaster - HMaster.metaScanner scanner: 5846860608957952540 /1725/22236401lvh6.gif,1196166373235, startKey: 2007-11-27 04:29:26,461 DEBUG hbase.HMas235 is assigned > 2007-11-27 04:29:26,622 DEBUG hbase.HMaster - Cces: splitA: true, splitB: true > ... > {code} > Here is regionserver log on node XX.XX.XX.33: > {code} > ... > 2007-11-27 04:26:29,318 INFO hbase.HRegionServer - MSG_REGION_OPEN : regionname: postlog,img216/1725/22236401lvh6.gif,1196166373235, startKey: , tableDesc: {name: postlog, families: {cookie:={name: cookie, max versions: 1, compression: NONE, in memory: false, max length: 2147483647, bloom filter: none}, ip:={name: ip, max versions: 1, compression: NONE, in memory: false, max length: 2147483647, bloom filter: none}}} > 2007-11-27 04:26:29,325 DEBUG hbase.HStore - starting 51396220/cookie (no reconstruction log) > 2007-11-27 04:26:30,420 DEBUG hbase.HRegionServer - flushing region postlog,img100/941/brandyfn6.gif,1196158363481 > 2007-11-27 04:26:30,420 DEBUG hbase.HRegion - Not flushing cache for region postlog,img100/941/brandyfn6.gif,1196158363481: snapshotMemcaches() determined that there was nothing to do > 2007-11-27 04:26:31,752 DEBUG hbase.HRegionServer - flushing region postlog,img149/2019/liyanaum6.gif,1196164996321 > 2007-11-27 04:26:31,753 DEBUG hbase.HRegion - Not flushing cache for region postlog,img149/2019/liyanaum6.gif,1196164996321: snapshotMemcaches() determined that there was nothing to do > 2007-11-27 04:26:31,753 DEBUG hbase.HRegionServer - flushing region postlog,img148/6048/p9290339kb0.jpg,1196162613173 > 2007-11-27 04:26:31,753 DEBUG hbase.HRegion - Not flushing cache for region postlog,img148/6048/p9290339kb0.jpg,1196162613173: snapshotMemcaches() determined that there was nothing to do > 2007-11-27 04:26:31,754 DEBUG hbase.HRegionServer - flushing region postlog,img132/5512/timeline8bx.png,1196159744555 > 2007-11-27 04:26:31,754 DEBUG hbase.HRegion - Not flushing cache for region postlog,img132/5512/timeline8bx.png,1196159744555: snapshotMemcaches() determined that there was nothing to do > 2007-11-27 04:26:37,915 DEBUG hbase.HRegionServer - flushing region postlog,img149/3724/whosherne4.jpg,1196164996322 > 2007-11-27 04:26:37,915 DEBUG hbase.HRegion - Not flushing cache for region postlog,img149/3724/whosherne4.jpg,1196164996322: snapshotMemcaches() determined that there was nothing to do > 2007-11-27 04:26:57,059 DEBUG hbase.HRegionServer - flushing region postlog,img148/8339/361de8578d356flip0hd.png,1196165445439 > 2007-11-27 04:26:57,060 DEBUG hbase.HRegion - Not flushing cache for region postlog,img148/8339/361de8578d356flip0hd.png,1196165445439: snapshotMemcaches() determined that there was nothing to do > 2007-11-27 04:27:03,604 DEBUG hbase.HStore - maximum sequence id for hstore 51396220/cookie is 30625781 > 2007-11-27 04:27:03,761 DEBUG hbase.HStore - starting 51396220/ip (no reconstruction log) > 2007-11-27 04:27:03,859 DEBUG hbase.HStore - maximum sequence id for hstore 51396220/ip is 30625781 > 2007-11-27 04:27:03,965 DEBUG hbase.HRegion - Next sequence id for region postlog,img216/1725/22236401lvh6.gif,1196166373235 is 30625782 > 2007-11-27 04:27:03,969 INFO hbase.HRegion - region postlog,img216/1725/22236401lvh6.gif,1196166373235 available > 2007-11-27 04:27:03,969 INFO hbase.HRegionServer - MSG_REGION_OPEN : regionname: postlog,img216/1725/22236401lvh6.gif,1196166373235, startKey: , tableDesc: {name: postlog, families: {cookie:={name: cookie, max versions: 1, compression: NONE, in memory: false, max length: 2147483647, bloom filter: none}, ip:={name: ip, max versions: 1, compression: NONE, in memory: false, max length: 2147483647, bloom filter: none}}} > 2007-11-27 04:27:04,578 INFO hbase.HRegionServer - MSG_REGION_CLOSE_WITHOUT_REPORT : regionname: postlog,img216/1725/22236401lvh6.gif,1196166373235, startKey: , tableDesc: {name: postlog, families: {cookie:={name: cookie, max versions: 1, compression: NONE, in memory: false, max length: 2147483647, bloom filter: none}, ip:={name: ip, max versions: 1, compression: NONE, in memory: false, max length: 2147483647, bloom filter: none}}} > 2007-11-27 04:27:04,578 DEBUG hbase.HRegion - Not flushing cache for region postlog,img216/1725/22236401lvh6.gif,1196166373235: snapshotMemcaches() determined that there was nothing to do > 2007-11-27 04:27:04,579 DEBUG hbase.HStore - closed 51396220/ip > 2007-11-27 04:27:04,580 DEBUG hbase.HStore - closed 51396220/cookie > 2007-11-27 04:27:04,580 INFO hbase.HRegion - closed postlog,img216/1725/22236401lvh6.gif,1196166373235 > .. > {code} > It looks like the assigned region hasn't been cleared from the unassigned regions list when the report comes in. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.