Mailing-List: contact user-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hbase.apache.org
Received-SPF: pass (nike.apache.org: domain of dejan.menges@gmail.com
 designates 209.85.217.182 as permitted sender)
MIME-Version: 1.0
References: 
 <CAEf6Z5+P7QmvLYkpFiTJe3Snmu7k8EQiMczGhxs6s8yJ0jEg6Q@mail.gmail.com>
 <CA+RK=_BeiNQPM8FnqWU6B8C7JH6srvZPcH70cqjnr=94rDcSuw@mail.gmail.com>
 <CAEf6Z5K_hLhumK5GiyZ_wRkdA5sCKjJjVS-u-+6UcXUon_mCEQ@mail.gmail.com>
In-Reply-To: 
 <CAEf6Z5K_hLhumK5GiyZ_wRkdA5sCKjJjVS-u-+6UcXUon_mCEQ@mail.gmail.com>
From: Dejan Menges <dejan.menges@gmail.com>
Date: Fri, 20 Mar 2015 12:36:03 +0000
Message-ID: 
 <CAEf6Z5JTbjr3R1QEt0aLty53+Nz2=6Ufk4NwNn1SfyS1wWf13g@mail.gmail.com>
Subject: Re: Strange issue when DataNode goes down
To: "user@hbase.apache.org" <user@hbase.apache.org>
Content-Type: multipart/alternative; boundary=001a11c36568465b510511b78fe1

--001a11c36568465b510511b78fe1
Content-Type: text/plain; charset=UTF-8

Hi,

Sorry for little bit late update, but managed to narrow it little bit down.

We didn't update yet, as we are using Hortonworks distribution right now,
and even if we update we will get 0.98.4. However, looks that issue here
was in our use case and configuration (still looking into it).

Basically, initially I saw that when one server goes down, we start having
performance issues in general, but it managed to be on our client side, due
to caching, and clients were trying to reconnect to nodes that were offline
and later trying to get regions from those nodes too. This is basically why
on server side I didn't manage to see anything in logs that would be at
least little bit interesting and point me into desired direction.

Another question that popped up to me is - in case server is down (and with
it DataNode and HRegionServer it was hosting) - what's optimal time to set
for HMaster to consider server dead reassign regions somewhere else, as
this is another performance bottleneck we hit during inability to access
regions? In our case it's configured to 15 minutes, and simple logic tells
me if you want it earlier then configure lower number of retries, but issue
is as always in details, so not sure if anyone knows some better math for
this?

And last question - is it possible to manually force HBase to reassign
regions? In this case, while HMaster is retrying to contact node that's
dead, it's impossible to force it using 'balancer' command.

Thanks a lot!

Dejan

On Tue, Mar 17, 2015 at 9:37 AM Dejan Menges <dejan.menges@gmail.com> wrote:

> Hi,
>
> To be very honest - there's no particular reason why we stick to this one,
> beside just lack of time currently to go through upgrade process, but looks
> to me that's going to be next step.
>
> Had a crazy day, didn't have time to go through all logs again, plus one
> of the machines (last one where we had this issue) is fully reprovisioned
> yesterday so I don't have logs from there anymore.
>
> Beside upgrading,  what I will talk about today, can you just point me to
> the specific RPC issue in 0.98.0? Thing is that we have some strange
> moments with RPC in this case, and just want to see if that's the same
> thing (and we were even suspecting to RPC).
>
> Thanks a lot!
> Dejan
>
> On Mon, Mar 16, 2015 at 9:32 PM, Andrew Purtell <apurtell@apache.org>
> wrote:
>
>> Is there a particular reason why you are using HBase 0.98.0? The latest
>> 0.98 release is 0.98.11. There's a known performance issue with 0.98.0
>> pertaining to RPC that was fixed in later releases, you should move up
>> from
>> 0.98.0. In addition hundreds of improvements and bug fixes have gone into
>> the ten releases since 0.98.0.
>>
>> On Mon, Mar 16, 2015 at 6:40 AM, Dejan Menges <dejan.menges@gmail.com>
>> wrote:
>>
>> > Hi All,
>> >
>> > We have a strange issue with HBase performance (overall cluster
>> > performance) in case one of datanodes in the cluster unexpectedly goes
>> > down.
>> >
>> > So scenario is like follows:
>> > - Cluster works fine, it's stable.
>> > - One DataNode unexpectedly goes down (PSU issue, network issue,
>> anything)
>> > - Whole HBase cluster goes down (performance becomes so bad that we
>> have to
>> > restart all RegionServers to get it back to life).
>> >
>> > Most funny and latest issue that happened was that we added new node to
>> the
>> > cluster (having 8 x 4T SATA disks) and we left just DataNode running on
>> it
>> > to give it couple of days to get some data. At some point in time, due
>> to
>> > hardware issue, server rebooted (twice during three hours) in moment
>> when
>> > it had maybe 5% of data it would have in a couple of days. Nothing else
>> > beside DataNode was running, and once it went down, it affected literary
>> > everything, and restarting RegionServers in the end fixed it.
>> >
>> > We are using HBase 0.98.0 with Hadoop 2.4.0
>> >
>>
>>
>>
>> --
>> Best regards,
>>
>>    - Andy
>>
>> Problems worthy of attack prove their worth by hitting back. - Piet Hein
>> (via Tom White)
>>
>
>

--001a11c36568465b510511b78fe1--