Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 74FC6EE53 for ; Fri, 8 Mar 2013 22:02:45 +0000 (UTC) Received: (qmail 17705 invoked by uid 500); 8 Mar 2013 22:02:43 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 17651 invoked by uid 500); 8 Mar 2013 22:02:43 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 17643 invoked by uid 99); 8 Mar 2013 22:02:43 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 08 Mar 2013 22:02:43 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of saint.ack@gmail.com designates 209.85.214.46 as permitted sender) Received: from [209.85.214.46] (HELO mail-bk0-f46.google.com) (209.85.214.46) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 08 Mar 2013 22:02:38 +0000 Received: by mail-bk0-f46.google.com with SMTP id j5so951428bkw.19 for ; Fri, 08 Mar 2013 14:02:17 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:x-received:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:content-type; bh=v94bTnfvCsrLa3KUvxeQOl6oytt2cV02s3NbPVxFFUs=; b=0dEbl5MP+kR5KzeskYGTesoTy8ViwUO087o0j3fe2lwv47qHLiqo9dnlCfCvKsLIsA G0SyD+HOaxun183ytuC7f+yxu9KW/hr/f1sfEik/XyOugdBZ7wZ92wBCg3QigGfU5CJW LFTc0b3qDHahXp3DPHqPqW2lyH6w3nnlxbNAln2rYJeCedWue3MoJfhq0nwngJq4jld6 SuaL5FFO0lkPum+OJd3hKBPxcLsVbO23DVGPrHI7H0rS3u/87zFbhVT+PabSuRjaMY+Q GZM3Q9jdCQw2U4iXn8sZBpXh9LYYlefuVbGXO0sHmwemIslsCvhQXUIJwDxBp2DiXDbd yKlQ== MIME-Version: 1.0 X-Received: by 10.204.244.135 with SMTP id lq7mr1550211bkb.138.1362780136940; Fri, 08 Mar 2013 14:02:16 -0800 (PST) Sender: saint.ack@gmail.com Received: by 10.204.147.22 with HTTP; Fri, 8 Mar 2013 14:02:16 -0800 (PST) In-Reply-To: <513A34D0.80507@psafe.com> References: <513A0758.3090507@psafe.com> <513A34D0.80507@psafe.com> Date: Fri, 8 Mar 2013 14:02:16 -0800 X-Google-Sender-Auth: WW6tVMzD4hv3Cs73OtNdKcZCIrs Message-ID: Subject: Re: RegionServers Crashing every hour in production env From: Stack To: Hbase-User Content-Type: multipart/alternative; boundary=f46d041c46aaf857cd04d770f98b X-Virus-Checked: Checked by ClamAV on apache.org --f46d041c46aaf857cd04d770f98b Content-Type: text/plain; charset=UTF-8 On Fri, Mar 8, 2013 at 10:58 AM, Pablo Musa wrote: > 0.94 currently doesn't support hadoop 2.0 >> Can you deploy hadoop 1.1.1 instead ? >> > > I am using cdh4.2.0 which uses this version as default installation. > I think it will be a problem for me to deploy 1.1.1 because I would need to > "upgrade" the whole cluster with 70TB of data (backup everything, go > offline, etc.). > > Is there a problem to use cdh4.2.0? > I should send my email to cdh list? > > That combo should be fine. > You Full GC'ing around this time? >> > > The GC shows it took a long time. However it does not make any sense > to be it, since the same ammount of data was cleaned before and AFTER > in just 0.01 secs! > > If JVM is full GC'ing, the application is stopped. > > [Times: user=0.08 sys=137.62, real=137.62 secs] > > Besides the whole time was used by system. That is what is bugging me. > > The below does not look like a full GC but that is a long pause in system time, enough to kill your zk session. You swapping? Hardware is good? St.Ack > ... > > > 1044.081: [GC 1044.081: [ParNew: 58970K->402K(59008K), 0.0040990 secs] > 275097K->216577K(1152704K), 0.0041820 secs] [Times: user=0.03 sys=0.00, > real=0.01 secs] > > 1087.319: [GC 1087.319: [ParNew: 52873K->6528K(59008K), 0.0055000 secs] > 269048K->223592K(1152704K), 0.0055930 secs] [Times: user=0.04 sys=0.01, > real=0.00 secs] > > 1087.834: [GC 1087.834: [ParNew: 59008K->6527K(59008K), 137.6353620 > secs] 276072K->235097K(1152704K), 137.6354700 secs] [Times: user=0.08 > sys=137.62, real=137.62 secs] > > 1226.638: [GC 1226.638: [ParNew: 59007K->1897K(59008K), 0.0079960 secs] > 287577K->230937K(1152704K), 0.0080770 secs] [Times: user=0.05 sys=0.00, > real=0.01 secs] > > 1227.251: [GC 1227.251: [ParNew: 54377K->2379K(59008K), 0.0095650 secs] > 283417K->231420K(1152704K), 0.0096340 secs] [Times: user=0.06 sys=0.00, > real=0.01 secs] > > > I really appreciate you guys helping me to find out what is wrong. > > Thanks, > Pablo > > > > On 03/08/2013 02:11 PM, Stack wrote: > >> What RAM says. >> >> 2013-03-07 17:24:57,887 INFO org.apache.zookeeper.****ClientCnxn: Client >> >> session timed out, have not heard from server in 159348ms for sessionid >> 0x13d3c4bcba600a7, closing socket connection and attempting reconnect >> >> You Full GC'ing around this time? >> >> Put up your configs in a place where we can take a look? >> >> St.Ack >> >> >> On Fri, Mar 8, 2013 at 8:32 AM, ramkrishna vasudevan < >> ramkrishna.s.vasudevan@gmail.**com > >> wrote: >> >> I think it is with your GC config. What is your heap size? What is the >>> data that you pump in and how much is the block cache size? >>> >>> Regards >>> Ram >>> >>> On Fri, Mar 8, 2013 at 9:31 PM, Ted Yu wrote: >>> >>> 0.94 currently doesn't support hadoop 2.0 >>>> >>>> Can you deploy hadoop 1.1.1 instead ? >>>> >>>> Are you using 0.94.5 ? >>>> >>>> Thanks >>>> >>>> On Fri, Mar 8, 2013 at 7:44 AM, Pablo Musa wrote: >>>> >>>> Hey guys, >>>>> as I sent in an email a long time ago, the RSs in my cluster did not >>>>> >>>> get >>> >>>> along >>>>> and crashed 3 times a day. I tried a lot of options we discussed in the >>>>> emails, but it not solved the problem. As I used an old version of >>>>> >>>> hadoop I >>>> >>>>> thought this was the problem. >>>>> >>>>> So, I upgraded from hadoop 0.20 - hbase 0.90 - zookeeper 3.3.5 to >>>>> >>>> hadoop >>> >>>> 2.0.0 >>>>> - hbase 0.94 - zookeeper 3.4.5. >>>>> >>>>> Unfortunately the RSs did not stop crashing, and worst! Now they crash >>>>> every >>>>> hour and some times when the RS that holds the .ROOT. crashes all >>>>> >>>> cluster >>> >>>> get >>>>> stuck in transition and everything stops working. >>>>> In this case I need to clean zookeeper znodes, restart the master and >>>>> >>>> the >>> >>>> RSs. >>>>> To avoid this case I am running on production with only ONE RS and a >>>>> monitoring >>>>> script that check every minute, if the RS is ok. If not, restart it. >>>>> * This case does not get the cluster stuck. >>>>> >>>>> This is driving me crazy, but I really cant find a solution for the >>>>> cluster. >>>>> I tracked all logs from the start time 16:49 from all interesting nodes >>>>> (zoo, >>>>> namenode, master, rs, dn2, dn9, dn10) and copied here what I think is >>>>> usefull. >>>>> >>>>> There are some strange errors in the DATANODE2, as an error copiyng a >>>>> >>>> block >>>> >>>>> to itself. >>>>> >>>>> The gc log points to GC timeout. However it is very weird that the RS >>>>> >>>> spend >>>> >>>>> so much time in GC while in the other cases it takes 0.001sec. Besides, >>>>> the time >>>>> spent, is in sys which makes me think that might be a problem in >>>>> >>>> another >>> >>>> place. >>>>> >>>>> I know that it is a bunch of logs, and that it is very difficult to >>>>> >>>> find >>> >>>> the >>>>> problem without much context. But I REALLY need some help. If it is not >>>>> >>>> the >>>> >>>>> solution, at least what I should read, where I should look, or which >>>>> >>>> cases >>>> >>>>> I >>>>> should monitor. >>>>> >>>>> Thank you very much, >>>>> Pablo Musa >>>>> >>>>> > --f46d041c46aaf857cd04d770f98b--