Return-Path: Delivered-To: apmail-hbase-user-archive@www.apache.org Received: (qmail 65529 invoked from network); 22 Feb 2011 20:41:54 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 22 Feb 2011 20:41:54 -0000 Received: (qmail 74291 invoked by uid 500); 22 Feb 2011 20:41:52 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 74233 invoked by uid 500); 22 Feb 2011 20:41:52 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 74223 invoked by uid 99); 22 Feb 2011 20:41:52 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 22 Feb 2011 20:41:52 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=FREEMAIL_FROM,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of jdcryans@gmail.com designates 209.85.214.53 as permitted sender) Received: from [209.85.214.53] (HELO mail-bw0-f53.google.com) (209.85.214.53) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 22 Feb 2011 20:41:44 +0000 Received: by bwg12 with SMTP id 12so2427852bwg.12 for ; Tue, 22 Feb 2011 12:41:24 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:content-type :content-transfer-encoding; bh=JDLYySJUAb4gd8H+toQPkmbNSrIk22nRP0d5ZThEcdw=; b=DVpiFWXKRYvrjxDPU+/DI9cXi14hqTcg2u3ZclPre/jTQn+Gf9ebBOTcZ3ff/I+n9w e/RqhgmUTOFNi4jOHXO/r/ZIHoG4Tl5Djo/YZ4AvpZYNiqNpM3YIx6Dlotjwb0Psunmy AIvpLOhFte6wVbh0pShGoYwlBOeEslYcU2Um4= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:content-type :content-transfer-encoding; b=tEKuOnxIV76Q4o0MgFg+p6TcKmes1JjauzvFxmhqkK7pd8yNJWBhc9q1EW4/OzbWQK L3n0Yny5eLeo5ozfNDdP4vMPSVf1t9DeiUUs1ODzR3v+GkGYqhS8s+Ibf+BVS6Z3AZK2 cVumjXQxXsTO8nJGrXckgqA0+dYugt/AU7RKI= MIME-Version: 1.0 Received: by 10.223.83.208 with SMTP id g16mr3971289fal.52.1298406865412; Tue, 22 Feb 2011 12:34:25 -0800 (PST) Sender: jdcryans@gmail.com Received: by 10.223.83.204 with HTTP; Tue, 22 Feb 2011 12:34:25 -0800 (PST) In-Reply-To: References: Date: Tue, 22 Feb 2011 15:34:25 -0500 X-Google-Sender-Auth: dJBq8khzEDY-VZuFZXbMvOW845c Message-ID: Subject: Re: HBase 0.90.0 region servers dying From: Jean-Daniel Cryans To: user@hbase.apache.org, Enis Soztutar Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org Ted asked about the JVM version but I don't think you answered that. In any case, try with u17. J-D On Sat, Feb 19, 2011 at 3:58 AM, Enis Soztutar w= rote: > Yes indeed but no luck. > > Enis > > On Fri, Feb 18, 2011 at 11:50 AM, Jean-Daniel Cryans > wrote: >> >> Just to make sure, you did check in the .out file after a failure right? >> >> J-D >> >> On Thu, Feb 17, 2011 at 10:14 PM, Enis Soztutar >> wrote: >> > Hi, >> > >> > Thanks everyone for the answers. >> > I had already =A0increase the file descriptors to 32768. The region >> > servers >> > and the zookeeper processes are dying, but datanode and tasktrackers >> > keep >> > running (they are configured with a max heap of 1Gb). The logs do not >> > contain any indication that something is going wrong. The last info on >> > the >> > logs are typical INFO level logs. =A0I have also checked for kernel lo= gs, >> > but >> > kernel does not report that it is killing the processes either. While >> > testing, two of the servers restarted at different times, which was th= e >> > original reason that I had suspected a memory error. But after we >> > replaced >> > the power supplies, nodes did not restart, but the processes kept dyin= g. >> > >> > For the load, the ycsb test for 10M records goes on for a while at 4K >> > inserts per sec, but cannot complete due to region servers dying one b= y >> > one. >> > iostat also shows light cpu and io utilization around 20%. Any more >> > suggestions for debugging would be more than welcome. >> > >> > Thanks, >> > Enis >> > >> > On Wed, Feb 16, 2011 at 5:13 AM, Eric wrote: >> > >> >> Did you increase the max open files on your system (in >> >> /etc/security/limits.conf) ? >> >> >> > >> >> 2011/2/16 Enis Soztutar >> >> >> >> > Hi, >> >> > >> >> > We have a newly setup a cluster of 5 nodes, each with 16 GB rams. W= e >> >> > use >> >> > HBase 0.90.0 on top of Hadoop from CDH3. When testing HBase under >> >> > heavy >> >> > load >> >> > generated bu YCSB, we consistently see region servers dying silentl= y, >> >> > without any logs or exceptions (not even in system logs). We couldn= 't >> >> track >> >> > down the problem, so we have =A0tested the same setup on a rackspac= e >> >> =A0cluster >> >> > with 7 nodes but similar hardware, and we didn't have any problem. >> >> > >> >> > We are suspecting a problem with the rams, or motherboards, but all >> >> memory >> >> > tests run successfully. I was wondering if anyone had similar >> >> > problems >> >> > before and is there anything you suggest to nail down the issue. >> >> > >> >> > Thanks, >> >> > Enis >> >> > >> >> >> > > >