Mailing-List: contact user-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hbase.apache.org
Received-SPF: pass (nike.apache.org: domain of jdcryans@gmail.com designates
 209.85.214.53 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:sender:in-reply-to:references:date
         :x-google-sender-auth:message-id:subject:from:to:content-type
         :content-transfer-encoding;
        b=tEKuOnxIV76Q4o0MgFg+p6TcKmes1JjauzvFxmhqkK7pd8yNJWBhc9q1EW4/OzbWQK
         L3n0Yny5eLeo5ozfNDdP4vMPSVf1t9DeiUUs1ODzR3v+GkGYqhS8s+Ibf+BVS6Z3AZK2
         cVumjXQxXsTO8nJGrXckgqA0+dYugt/AU7RKI=
MIME-Version: 1.0
Sender: jdcryans@gmail.com
In-Reply-To: <AANLkTikwEUvNsNqhiDVYFaGEM5A3rP15xWZSygUf3yxJ@mail.gmail.com>
References: <AANLkTinuD-ObLa_Jrxi_AYud7GCYfUn8VcAN3dNtd4e2@mail.gmail.com>
	<AANLkTi=cKNAskc84ND0s_7fu58eoXTPLL8E_yXz4z0Pm@mail.gmail.com>
	<AANLkTinFv0iQFM=UKbWTF1FcWNCAVgN4jHdazTWLa-F7@mail.gmail.com>
	<AANLkTinA_DzWdbTLsh4-XZV7+evLNKTSQp8zd9gHu373@mail.gmail.com>
	<AANLkTikwEUvNsNqhiDVYFaGEM5A3rP15xWZSygUf3yxJ@mail.gmail.com>
Date: Tue, 22 Feb 2011 15:34:25 -0500
Message-ID: <AANLkTi=LD=waTQEtuybVDix1_nYctOSCzqjF49kcbnNx@mail.gmail.com>
Subject: Re: HBase 0.90.0 region servers dying
From: Jean-Daniel Cryans <jdcryans@apache.org>
To: user@hbase.apache.org, Enis Soztutar <enis.soz.nutch@gmail.com>
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Ted asked about the JVM version but I don't think you answered that.
In any case, try with u17.

J-D

On Sat, Feb 19, 2011 at 3:58 AM, Enis Soztutar <enis.soz.nutch@gmail.com> w=
rote:
> Yes indeed but no luck.
>
> Enis
>
> On Fri, Feb 18, 2011 at 11:50 AM, Jean-Daniel Cryans <jdcryans@apache.org=
>
> wrote:
>>
>> Just to make sure, you did check in the .out file after a failure right?
>>
>> J-D
>>
>> On Thu, Feb 17, 2011 at 10:14 PM, Enis Soztutar
>> <enis.soz.nutch@gmail.com> wrote:
>> > Hi,
>> >
>> > Thanks everyone for the answers.
>> > I had already =A0increase the file descriptors to 32768. The region
>> > servers
>> > and the zookeeper processes are dying, but datanode and tasktrackers
>> > keep
>> > running (they are configured with a max heap of 1Gb). The logs do not
>> > contain any indication that something is going wrong. The last info on
>> > the
>> > logs are typical INFO level logs. =A0I have also checked for kernel lo=
gs,
>> > but
>> > kernel does not report that it is killing the processes either. While
>> > testing, two of the servers restarted at different times, which was th=
e
>> > original reason that I had suspected a memory error. But after we
>> > replaced
>> > the power supplies, nodes did not restart, but the processes kept dyin=
g.
>> >
>> > For the load, the ycsb test for 10M records goes on for a while at 4K
>> > inserts per sec, but cannot complete due to region servers dying one b=
y
>> > one.
>> > iostat also shows light cpu and io utilization around 20%. Any more
>> > suggestions for debugging would be more than welcome.
>> >
>> > Thanks,
>> > Enis
>> >
>> > On Wed, Feb 16, 2011 at 5:13 AM, Eric <eric.xkcd@gmail.com> wrote:
>> >
>> >> Did you increase the max open files on your system (in
>> >> /etc/security/limits.conf) ?
>> >>
>> >
>> >> 2011/2/16 Enis Soztutar <enis.soz.nutch@gmail.com>
>> >>
>> >> > Hi,
>> >> >
>> >> > We have a newly setup a cluster of 5 nodes, each with 16 GB rams. W=
e
>> >> > use
>> >> > HBase 0.90.0 on top of Hadoop from CDH3. When testing HBase under
>> >> > heavy
>> >> > load
>> >> > generated bu YCSB, we consistently see region servers dying silentl=
y,
>> >> > without any logs or exceptions (not even in system logs). We couldn=
't
>> >> track
>> >> > down the problem, so we have =A0tested the same setup on a rackspac=
e
>> >> =A0cluster
>> >> > with 7 nodes but similar hardware, and we didn't have any problem.
>> >> >
>> >> > We are suspecting a problem with the rams, or motherboards, but all
>> >> memory
>> >> > tests run successfully. I was wondering if anyone had similar
>> >> > problems
>> >> > before and is there anything you suggest to nail down the issue.
>> >> >
>> >> > Thanks,
>> >> > Enis
>> >> >
>> >>
>> >
>
>