Return-Path: Delivered-To: apmail-hbase-user-archive@www.apache.org Received: (qmail 75150 invoked from network); 28 May 2010 17:08:33 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 28 May 2010 17:08:33 -0000 Received: (qmail 76615 invoked by uid 500); 28 May 2010 17:08:33 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 76515 invoked by uid 500); 28 May 2010 17:08:33 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 76507 invoked by uid 99); 28 May 2010 17:08:33 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 28 May 2010 17:08:33 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=FREEMAIL_FROM,HTML_MESSAGE,SPF_PASS,T_FRT_BELOW2,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of nazario.lucas@gmail.com designates 209.85.160.169 as permitted sender) Received: from [209.85.160.169] (HELO mail-gy0-f169.google.com) (209.85.160.169) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 28 May 2010 17:08:26 +0000 Received: by gyg4 with SMTP id 4so1251484gyg.14 for ; Fri, 28 May 2010 10:08:05 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:received:in-reply-to :references:date:message-id:subject:from:to:content-type; bh=Iba+19TitBhZxiwbvW8OQ4YIVaKVFhUG1yV0IdfcUxE=; b=Spp8iaUuGGsN4rVKe/T3JDJJkVO/acktrZuQDs3OaSEdOVve+bfQ2udV2y+oc+7W+z zVvgvc7aNGOyXxmwC3aaC+JXuvfPdeJw572N46G+12zLwj80Ayk5BYOMuY1+LB5mWmvw n6pON34i+OCcG/lzc7OxgSnajBKdqB2rz2XcU= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=imcn2YS7xA0Hlr2yG36P4gAgaTSJYPXSZeZwcRxVFxqDXA+GaJvjmioCD8s9/08BBi lxeetDoRbsUP5Y+x/X8smzTl+YhNKzl9Cxd3Svf80ZvqtuNqNKkzcQMH3znS0s+/mrZ9 O3sYieSC3PN/uVpPh9bM62v8V4dYFUiu/ETSY= MIME-Version: 1.0 Received: by 10.150.213.14 with SMTP id l14mr1694435ybg.241.1275066484714; Fri, 28 May 2010 10:08:04 -0700 (PDT) Received: by 10.150.140.2 with HTTP; Fri, 28 May 2010 10:08:04 -0700 (PDT) In-Reply-To: References: <4BFED498.5020408@apache.org> Date: Fri, 28 May 2010 14:08:04 -0300 Message-ID: Subject: Re: Zookeeper apparently going down From: =?ISO-8859-1?Q?Lucas_Naz=E1rio_dos_Santos?= To: user@hbase.apache.org Content-Type: multipart/alternative; boundary=000e0cd35abce2ec420487aa8c4a X-Virus-Checked: Checked by ClamAV on apache.org --000e0cd35abce2ec420487aa8c4a Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable I restarted the process and the logs are gone. I'll keep monitoring HBase and if this error happen once again I post the logs here. Thanks a lot. Lucas On Fri, May 28, 2010 at 1:50 PM, Jean-Daniel Cryans wr= ote: > Yeah this is very suspicious. Also since the error the master tripped > over happened just after the region server stopped logging in that > file seems even more suspicious. Usually when there's an error in the > regionserver's main thread it will go to sysout so that's the .out > file instead of .log file, but every time you restart a process it > overwrites it, so unless you didn't restart the region server we > probably lost the info that were in there. And if the process did die, > then it really explains why the master wasn't able to connect to it. > > J-D > > On Fri, May 28, 2010 at 8:37 AM, Lucas Naz=E1rio dos Santos > wrote: > > Here are the complete logs: > > > > > http://www.ninvest.com.br/docs/logs_hbase/hbase-root-master-ip-10-251-158= -224.log > > > http://www.ninvest.com.br/docs/logs_hbase/hbase-root-zookeeper-ip-10-251-= 158-224.log > > > http://www.ninvest.com.br/docs/logs_hbase/hbase-root-regionserver-ip-10-2= 51-158-224.log > > > > The regionserver stopped logging at 8:31am. Strange... > > > > I hope this help. > > > > Lucas > > > > > > On Thu, May 27, 2010 at 8:09 PM, Jean-Daniel Cryans >wrote: > > > >> On Thu, May 27, 2010 at 4:01 PM, Lucas Naz=E1rio dos Santos > >> wrote: > >> > Thanks a lot for the responses. I'll be monitoring HBase and get bac= k > in > >> > touch if it happens again. > >> > > >> > Maybe HBase could employ a mechanism to automatically recover from > >> > connectivity issues like the one I had gone through. Then me and > others > >> > wouldn't need to manually restart it. > >> > >> Well usually if one machine is not reachable, it's not a big deal > >> since there are other machines to connect to and HBase redistributes > >> the regions to them. Also, why is it refused? Can we see the region > >> server log? > >> > >> > > >> > I still didn't get why the master kept failing even after its > recovery, > >> and > >> > why I had to stop/start the cluster in order to get rid of the > >> "Connection > >> > refused" error. > >> > >> I'd also like to understand why the region server isn't responding, > >> the master can only know so much. > >> > >> > > >> > I'm assuming it's not big deal and my solution can live with it. > >> > > >> > More logs bellow. > >> > > >> > >> Consider pastebin or a web server next time ;) > >> > > > --000e0cd35abce2ec420487aa8c4a--