Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 74EB1200C83 for ; Sun, 28 May 2017 21:48:43 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 737DC160BD8; Sun, 28 May 2017 19:48:43 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 449EC160BAF for ; Sun, 28 May 2017 21:48:42 +0200 (CEST) Received: (qmail 48254 invoked by uid 500); 28 May 2017 19:48:39 -0000 Mailing-List: contact dev-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hbase.apache.org Delivered-To: mailing list dev@hbase.apache.org Received: (qmail 48223 invoked by uid 99); 28 May 2017 19:48:39 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 28 May 2017 19:48:39 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 8FE44C1892; Sun, 28 May 2017 19:48:38 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 0.289 X-Spam-Level: X-Spam-Status: No, score=0.289 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FORGED_HOTMAIL_RCVD2=1.187, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-2.796, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=hotmail.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id z1UgXxuQTUpf; Sun, 28 May 2017 19:48:35 +0000 (UTC) Received: from NAM03-DM3-obe.outbound.protection.outlook.com (mail-oln040092008010.outbound.protection.outlook.com [40.92.8.10]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id B52205FBB9; Sun, 28 May 2017 19:48:34 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=hotmail.com; s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version; bh=6qpaw0MFUgGvKYl+KwtvTVx/XGy79jvdVs9q251FUEM=; b=O27ZhhAq14+fcsrQVupXzBT0mii2ItPN/AnCcG7f1NVUS7spShKKVcqn/ADbltGHcfDRyLIYICTpHbQoOdMITEZWfuC8qdiJRFxHcOzKLhZOyV+5Pi1I9ct04mIqQtzUvw7mhhkmVFBLadMhGSckd9KTD1Ap6wUZgBTdEv525rS5P/A9nHHV5cj9L2gmlulZs2J7+B9oh7jOyoR4kiXxYqaNr2//5zWPwTLBhuyOeUO5Hz+zcqJD/sWlAABkKhfQI+s/8yYbwQsyb0wT1ik/U0v2MQYWdCo2LkOol1g+suw62hHBxiUCyFq+80vDGBz0Kj8cWkJuN2ARHUbiEnGF7g== Received: from BY2NAM03FT050.eop-NAM03.prod.protection.outlook.com (10.152.84.55) by BY2NAM03HT199.eop-NAM03.prod.protection.outlook.com (10.152.85.217) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384_P384) id 15.1.1101.12; Sun, 28 May 2017 19:48:27 +0000 Received: from MWHPR14MB1293.namprd14.prod.outlook.com (10.152.84.56) by BY2NAM03FT050.mail.protection.outlook.com (10.152.85.137) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256_P256) id 15.1.1101.12 via Frontend Transport; Sun, 28 May 2017 19:48:27 +0000 Received: from MWHPR14MB1293.namprd14.prod.outlook.com ([10.173.102.19]) by MWHPR14MB1293.namprd14.prod.outlook.com ([10.173.102.19]) with mapi id 15.01.1124.017; Sun, 28 May 2017 19:48:27 +0000 From: jeff saremi To: Yu Li CC: "dev@hbase.apache.org" , hbase-user Subject: Re: What is Dead Region Servers and how to clear them up? Thread-Topic: What is Dead Region Servers and how to clear them up? Thread-Index: AQHS1KdN5y0CjJuNS0au+NbkeUqhZaID2KezgAAC3ACAAAKQpoABb6LNgAGIdESAAAL+AIAAEiX1gAAMXYCAAAdWRIAAGuwAgACOCICAANGqyYAACugAgAAINSKAACDuAIABNJpEgAAf7QCAACsCtg== Date: Sun, 28 May 2017 19:48:27 +0000 Message-ID: References: , In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: authentication-results: gmail.com; dkim=none (message not signed) header.d=none;gmail.com; dmarc=none action=none header.from=hotmail.com; x-incomingtopheadermarker: OriginalChecksum:A9243E821728C8AF2F856B4BF81A392D21CCB8FD3C9970A08959FAB77CE86DCF;UpperCasedChecksum:8D53402886430CDF200C505E1812F77F01797F17E1E9A25C9D083BA9D2800168;SizeAsReceived:9796;Count:46 x-ms-exchange-messagesentrepresentingtype: 1 x-tmn: [x70+EdxMOMAVTLfuR6hlz2oX1Uf0nAJhFgndpwoOGZsWJK9h5MSPmw==] x-ms-publictraffictype: Email x-microsoft-exchange-diagnostics: 1;BY2NAM03HT199;24:5Eq3uhEODGvd0RVFHRR6VJO2rHR/eA9aLJphxX8e+rdBrEMKnpv9GPAiiQis7WHSd7T1BJ1rbnychmiNvm/ZkzC+6nIzCqDYwklpn8TOQJw=;7:sKPK283He3u4nCUs2J58ZdVrwmcFMJSxy5Ewmnl30Kh8I/pX7qBTeTB5gOF9cJR8oLvRaRIEKKo7K4cR/ty9FEDCpOpuTiOVX4hyQ6sL8Esna/UpZYlMuVi/9icZwtT9xwCx3Ww/h+M3HQlgUu+n9qThrSENd/4uMnpQUXpzj12TfQeR8b4crSKm4ta+ZHBlNTW9Kbi7jVoL5SzF3HD8qYeTL2G44TGf16q8/GLclkv9tNItYSIAwUWBz4ucV1I56rtUBtzkJKqv0fzIceang/PyazdZm+jYd39vU2LRxIbOPFJUfRxk7srQh/Yo9CF0 x-incomingheadercount: 46 x-eopattributedmessage: 0 x-forefront-antispam-report: EFV:NLI;SFV:NSPM;SFS:(7070007)(98901004);DIR:OUT;SFP:1901;SCL:1;SRVR:BY2NAM03HT199;H:MWHPR14MB1293.namprd14.prod.outlook.com;FPR:;SPF:None;LANG:en; x-ms-traffictypediagnostic: BY2NAM03HT199: x-ms-office365-filtering-correlation-id: 2ac9c2fc-9ae4-44f3-9fa2-08d4a60282b4 x-microsoft-antispam: UriScan:;BCL:0;PCL:0;RULEID:(22001)(201702061074)(5061506573)(5061507331)(1603103135)(2017031320274)(2017031324274)(2017031323274)(2017031322274)(1601125374)(1603101448)(1701031045);SRVR:BY2NAM03HT199; x-exchange-antispam-report-cfa-test: BCL:0;PCL:0;RULEID:(100000700073)(100105000095)(100000701073)(100105300095)(100000702073)(100105100095)(444000031);SRVR:BY2NAM03HT199;BCL:0;PCL:0;RULEID:(100000800073)(100110000095)(100000801073)(100110300095)(100000802073)(100110100095)(100000803073)(100110400095)(100000804073)(100110200095)(100000805073)(100110500095);SRVR:BY2NAM03HT199; x-forefront-prvs: 03218BFD9F spamdiagnosticoutput: 1:99 spamdiagnosticmetadata: NSPM Content-Type: multipart/alternative; boundary="_000_MWHPR14MB1293F783946B6E17B5351EA4C1F20MWHPR14MB1293namp_" MIME-Version: 1.0 X-OriginatorOrg: hotmail.com X-MS-Exchange-CrossTenant-originalarrivaltime: 28 May 2017 19:48:27.8478 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Internet X-MS-Exchange-CrossTenant-id: 84df9e7f-e9f6-40af-b435-aaaaaaaaaaaa X-MS-Exchange-Transport-CrossTenantHeadersStamped: BY2NAM03HT199 archived-at: Sun, 28 May 2017 19:48:43 -0000 --_000_MWHPR14MB1293F783946B6E17B5351EA4C1F20MWHPR14MB1293namp_ Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Yes Yu. What you're suggesting would work for us too and would still be app= reciated. thanks a lot jeff ________________________________ From: Yu Li Sent: Sunday, May 28, 2017 10:13:38 AM To: jeff saremi Cc: dev@hbase.apache.org; hbase-user Subject: Re: What is Dead Region Servers and how to clear them up? Thanks for the additional information Jeff, interesting scenario. Let me re-explain: dead server means on this node (or container, in your ca= se) there was a regionserver process once but not now. This doesn't indicat= e the current health state of the cluster, but only tells the fact and alar= m operator to give a check on those nodes/containers to see what problem ca= use them dead. But I admit that these might cause confusion. And as I proposed in previous mail, I think in the Yarn/Mesos deployment sc= enario we need to supply a command to clear those dead servers. To be more = specified, after all the actions, no matter automatic ones like WAL split a= nd zk clearance, or the manual ones like hbck -repair, as long as we're sur= e we don't need to care about those dead servers any more, we could remove = them from master UI. If this satisfies what you desire, I could open a JIRA= and get the work done (smile). Let me know your thoughts, thanks. Best Regards, Yu On 28 May 2017 at 23:26, jeff saremi > wrote: I think more and more deployments are being made dynamic using Yarn and Mes= os. Going back to a fixed set of servers is not going to eliminate the prob= lem i'm talking about. Making assumptions that the region servers come back= on the same node is too optimistic. Let me try this a different way to see if I can make my point: - A cluster is either healthy or not healthy. - If the cluster is unhealthy, then it can be made healthy using either ext= ernal tools (hbck) or the internal agreement of master-regionserver. If thi= s is not achievable, then the cluster must be discarded. - The cluster is now healthy, meaning that no information should be lingeri= ng on such as dead server, dead regions, or whatever anywhere in the system= . And moreover no such information must ever be brought up to the attention= of the administrators of the cluster. - If there is such information still hiding in some place in the system, th= en it only means that the mechansim (hbck or hbase itself) that made the sy= stem healthy did not complete its job in cleaning up what is needed to be c= leaned up ________________________________ From: Ted Yu > Sent: Saturday, May 27, 2017 1:54:50 PM To: dev@hbase.apache.org Cc: Hbase-User; Yu Li Subject: Re: What is Dead Region Servers and how to clear them up? The involvement of Yarn can explain why you observed relatively more dead servers (compared to traditional deployment). Suppose in first run, Yarn allocates containers for region servers on a set of nodes. Subsequently, Yarn may choose nodes (for the same number of servers) which are not exactly the same nodes in the previous run. What Yu Li described as restarting server is on the same node where the server was running previously. Cheers On Sat, May 27, 2017 at 11:59 AM, jeff saremi > wrote: > Yes. we don't have fixed servers with the exceptions of ZK machines. > > We have 3 yarn jobs one for each of master, region, and thrift servers > each launched separately with different number of nodes. I hope that's no= t > what is causing problems. > > ________________________________ > From: Ted Yu > > Sent: Saturday, May 27, 2017 11:27:36 AM > To: dev@hbase.apache.org > Cc: Hbase-User; Yu Li > Subject: Re: What is Dead Region Servers and how to clear them up? > > Jeff: > bq. We run our cluster on Yarn and upon restarting jobs in Yarn > > Can you clarify a bit more - are you running hbase processes inside Yarn > container ? > > Cheers > > On Sat, May 27, 2017 at 10:58 AM, jeff saremi > > wrote: > > > Thanks @Yu Li > > > > You are absolutely correct. Dead RS's will happen regardless. My issue > > with this is more "psychological". If I have done everything needed to = be > > done to ensure that RSs are running fine and regions are assigned and > such > > and hbck reports are consistent then how is this list of dead region > > servers helping me? other than causing anxiety? > > We run our cluster on Yarn and upon restarting jobs in Yarn we get a lo= t > > of inconsistent, unavailable regions. (and this is only one scenario). > Then > > we'll run hbck with -repair option (and i was wrong here too: hbck does > > take care of some issues) and restart the master(s). After that there > seem > > to be no more issues other than dead region servers being still reporte= d. > > We should not have this anymore after having taken all precautions to > reset > > the system properly. > > > > If was trying to write something similar to what hbck would do to take > > care of this specific issue. I wouldn't mind contributing to the hbck > > itself either. However I needed to understand where this list comes fro= m > > and why. These are things that I could possibly automate (after all the > > other steps i mentioned): > > - check the ZK list of RS's. If any of the dead RS's found, remove node > > > > - check hdfs root WALs folder. If there are any with the dead RS's name > in > > them, delete them. (here we need to take precaution as @Enis mentioned; > > possibly if the node timestamp has not been changed in a while) > > > > - what else? These steps are not enough > > > > For instance, we currently have 17 servers being reported as dead. Only > > 3-4 of them show up in hdfs with "-splitting" in their WALS folder. Whe= re > > do the rest come from? > > thanks > > > > Jeff > > > > ________________________________ > > From: Yu Li > > > Sent: Friday, May 26, 2017 10:18:09 PM > > To: Hbase-User > > Cc: dev@hbase.apache.org > > Subject: Re: What is Dead Region Servers and how to clear them up? > > > > bq. And having a list of "dead" servers is not a healthy thing to have. > > I don't think the existence of "dead" servers means the service is > > unhealthy, especially in a distributed system. Besides hbase, HDFS also > > shows Live and Dead nodes in namenode UI, and people won't regard HDFS = as > > unhealthy if there're dead nodes. > > > > In HBase, if some RS aborts due to unexpected issue like long GC, > normally > > we will restart it and once it's restarted and report to master, it wil= l > be > > removed from the dead server list. So when we observed dead server in > > Master UI, the first thing is to check the root cause and restart it if > it > > won't cause further issue. > > > > However, sometimes we may find the server aborted due to some hardware > > failure and we must offline the server for repairing. Or we need to mov= e > > some nodes to join other clusters so we stop the RS process on purpose.= I > > guess this is the case you're dealing with @jeff? If so, I think it's a > > reasonable requirement that we supply a command in hbase to clear the > dead > > nodes when operator assure they no longer serves. > > > > Best Regards, > > Yu > > > > On 27 May 2017 at 04:49, Enis S=F6ztutar > wrote: > > > > > In general if there are no regions in transition, the WAL recovery ha= s > > > already finished. You can watch the master's log4j log for those > entries, > > > but the lack of regions in transition is the easiest way to identify. > > > > > > Enis > > > > > > On Fri, May 26, 2017 at 12:14 PM, jeff saremi > > > > wrote: > > > > > > > thanks Enis > > > > > > > > I apologize for earlier > > > > > > > > This looks very close to our issue > > > > When you say: "there is no "WAL" recovery is happening", how could = i > > make > > > > sure of that? Thanks > > > > > > > > Jeff > > > > > > > > > > > > ________________________________ > > > > From: Enis S=F6ztutar > > > > > Sent: Friday, May 26, 2017 11:47:11 AM > > > > To: dev@hbase.apache.org > > > > Cc: hbase-user > > > > Subject: Re: What is Dead Region Servers and how to clear them up? > > > > > > > > Jeff, please be respectful to be people who are trying to help you. > > This > > > is > > > > not acceptable behavior and will result in consequences next time. > > > > > > > > On the specific issue that you are seeing, it is highly likely that > you > > > are > > > > seeing this: https://issues.apache.org/jira/browse/HBASE-14223. > Having > > > > those servers in the dead servers list will not hurt operations, or > > > > runtimes or anything else. Possibly for those servers, there is not > new > > > > instance of the regionserver running in the same host and ports. > > > > > > > > If you want to manually clean out these, you can follow these steps= : > > > > - Manually move these directries from the file system: > > > > /WALs/dead-server-splitting > > > > - ONLY do this if you are sure that there is no "WAL" recovery is > > > > happening, and there is only WAL files with names containing ".meta= ." > > > > - Restart HBase master. > > > > > > > > Upon restart, you can see that these do not show up anymore. For mo= re > > > > technical details, please refer to the jira link. > > > > > > > > Enis > > > > > > > > On Fri, May 26, 2017 at 11:03 AM, jeff saremi < > jeffsaremi@hotmail.com> > > > > wrote: > > > > > > > > > Thank you for the GFY answer > > > > > > > > > > And i guess to figure out how to fix these I can always go throug= h > > the > > > > > HBase source code. > > > > > > > > > > > > > > > ________________________________ > > > > > From: Dima Spivak > > > > > > Sent: Friday, May 26, 2017 9:58:00 AM > > > > > To: hbase-user > > > > > Subject: Re: What is Dead Region Servers and how to clear them up= ? > > > > > > > > > > Sending this back to the user mailing list. > > > > > > > > > > RegionServers can die for many reasons. Looking at your > RegionServer > > > log > > > > > files should give hints as to why it's happening. > > > > > > > > > > > > > > > -Dima > > > > > > > > > > On Fri, May 26, 2017 at 9:48 AM, jeff saremi < > jeffsaremi@hotmail.com > > > > > > > > wrote: > > > > > > > > > > > I had posted this to the user mailing list and I have not got a= ny > > > > direct > > > > > > answer to my question. > > > > > > > > > > > > Where do dead RS's come from and how can they be cleaned up? > > Someone > > > in > > > > > > the midst of developers should know this. > > > > > > > > > > > > thanks > > > > > > > > > > > > Jeff > > > > > > > > > > > > ________________________________ > > > > > > From: jeff saremi > > > > > > > Sent: Thursday, May 25, 2017 10:23:17 AM > > > > > > To: user@hbase.apache.org > > > > > > Subject: Re: What is Dead Region Servers and how to clear them > up? > > > > > > > > > > > > I'm still looking to get hints on how to remove the dead region= s. > > > > thanks > > > > > > > > > > > > ________________________________ > > > > > > From: jeff saremi > > > > > > > Sent: Wednesday, May 24, 2017 12:27:06 PM > > > > > > To: user@hbase.apache.org > > > > > > Subject: Re: What is Dead Region Servers and how to clear them > up? > > > > > > > > > > > > i'm trying to eliminate the dead region servers. > > > > > > > > > > > > ________________________________ > > > > > > From: Ted Yu > > > > > > > Sent: Wednesday, May 24, 2017 12:17:40 PM > > > > > > To: user@hbase.apache.org > > > > > > Subject: Re: What is Dead Region Servers and how to clear them > up? > > > > > > > > > > > > bq. running hbck (many times > > > > > > > > > > > > Can you describe the specific inconsistencies you were trying t= o > > > > resolve > > > > > ? > > > > > > Depending on the inconsistencies, advice can be given on the be= st > > > known > > > > > > hbck command arguments to use. > > > > > > > > > > > > Feel free to pastebin master log if needed. > > > > > > > > > > > > On Wed, May 24, 2017 at 12:10 PM, jeff saremi < > > > jeffsaremi@hotmail.com> > > > > > > wrote: > > > > > > > > > > > > > these are the things I have done so far: > > > > > > > > > > > > > > > > > > > > > - restarting master (few times) > > > > > > > > > > > > > > - running hbck (many times; this tool does not seem to be doi= ng > > > > > anything > > > > > > > at all) > > > > > > > > > > > > > > - checking the list of region servers in ZK (none of the dead > > ones > > > > are > > > > > > > listed here) > > > > > > > > > > > > > > - checking the WALs under /WALs. Out of 11 dead > ones > > > > only 3 > > > > > > > are listed here with "-splitting" at the end of their names a= nd > > > they > > > > > > > contain one single file like: 1493846660401..meta. > > > 1493922323600.meta > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ________________________________ > > > > > > > From: jeff saremi > > > > > > > > Sent: Wednesday, May 24, 2017 9:04:11 AM > > > > > > > To: user@hbase.apache.org > > > > > > > Subject: What is Dead Region Servers and how to clear them up= ? > > > > > > > > > > > > > > Apparently having dead region servers is so common that a > section > > > of > > > > > the > > > > > > > master console is dedicated to that? > > > > > > > How can we clean this up (preferably in an automated fashion)= ? > > Why > > > > > isn't > > > > > > > this being done by HBase automatically? > > > > > > > > > > > > > > > > > > > > > thanks > > > > > > > > > > > > > > > > > > > > > > > > > > > > --_000_MWHPR14MB1293F783946B6E17B5351EA4C1F20MWHPR14MB1293namp_--