Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 8E41B17E82 for ; Mon, 13 Apr 2015 18:11:22 +0000 (UTC) Received: (qmail 20275 invoked by uid 500); 13 Apr 2015 18:11:19 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 20209 invoked by uid 500); 13 Apr 2015 18:11:19 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 20197 invoked by uid 99); 13 Apr 2015 18:11:19 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 13 Apr 2015 18:11:19 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of dejan.menges@gmail.com designates 209.85.215.48 as permitted sender) Received: from [209.85.215.48] (HELO mail-la0-f48.google.com) (209.85.215.48) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 13 Apr 2015 18:11:15 +0000 Received: by labbd9 with SMTP id bd9so64082826lab.2 for ; Mon, 13 Apr 2015 11:08:39 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :content-type; bh=FKo4Zec0qkW4X3dl6MhrzaZ1VKZ/fiWaM5cOvDPEGE4=; b=lbW16WSVQCsv1O+BaZEARDi+kTmvAYkc06iy3LScloL6DypoAxBO9S7jDijAl5qa3V iLSl7zB9qmklWUqy9Rmkdpqkqv0y72QGNGb5jdJ66TGrY9mJ6NXjJLo1ELAlTAK9y+u5 GgwERtXHsvI7xY5r1WWjv+kzXwKypFLdZDBz/wSnCgfE6/KciltvGkweB7fldC6FetPa D2Z3bQWrEt+6bLJWXbEfkREvBctrF87yxDamUlN/xcLfTAxk8YgzT734pPzY7Z/0I8S5 SDO/4jFBldPeknMLJluurCiPWd13+hhD8DDLKT4H7gzTr5fME3smtpKW1RMNH/shdK+t L/ZA== X-Received: by 10.112.36.194 with SMTP id s2mr14351330lbj.94.1428948519014; Mon, 13 Apr 2015 11:08:39 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Dejan Menges Date: Mon, 13 Apr 2015 18:08:38 +0000 Message-ID: Subject: Re: How server gets into failed servers list? To: "user@hbase.apache.org" Content-Type: multipart/alternative; boundary=001a11341aeee0d46905139f00d2 X-Virus-Checked: Checked by ClamAV on apache.org --001a11341aeee0d46905139f00d2 Content-Type: text/plain; charset=UTF-8 Hi Esteban, Thanks for pointing to that, will try to collect all logs tomorrow and to take deeper look and post here specific errors. Yes, good news are that all logs are preserved. Thanks a lot, Dejan On Mon, Apr 13, 2015 at 8:01 PM Esteban Gutierrez wrote: > Hi Dejan, > > Do you have the logs from any of those failed region servers? Usually in > case of a critical failure the RS will shutdown itself or if the RS "hangs" > for a long time and the master will start processing the expiration of that > RS and reject the RS if it tries to reconnect with a YouAreDeadException. > The HBase master and RS logs for sure will tell us. > > thanks, > esteban. > > > -- > Cloudera, Inc. > > > On Mon, Apr 13, 2015 at 1:11 AM, Dejan Menges > wrote: > > > Hi, > > > > We had some issues recently with HDFS - hardware issue with one of the > > nodes, nodes died, HDFS recovered, but we figured out that something is > > wrong with HBase. Checking HMaster log, we saw that bunch of our region > > servers got to the famous failed servers list, and it was going on and on > > until we restarted every one of them. > > > > Are we doing something wrong? Is it possible somehow to tune this out, > once > > the server is in this list to forget about it or something? > > > > Main question - how HMaster decides at all that server should be in the > > failed server list, and what does this means exactly? > > > > Was looking into HBase book, googling, but beside some generic answers > > wasn't able to find anything more internal. > > > > Thanks in advance! > > > --001a11341aeee0d46905139f00d2--