Mailing-List: contact user-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hbase.apache.org
Received-SPF: pass (athena.apache.org: domain of nkeywal@gmail.com designates
 209.85.212.178 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CANZDn9vyDzAP9Z8QUPBSeC1rjFgXaU1nfP3DSWo6c-LbHEK67Q@mail.gmail.com>
References: 
 <CAEf6Z5+P7QmvLYkpFiTJe3Snmu7k8EQiMczGhxs6s8yJ0jEg6Q@mail.gmail.com>
 <CA+RK=_BeiNQPM8FnqWU6B8C7JH6srvZPcH70cqjnr=94rDcSuw@mail.gmail.com>
 <CAEf6Z5K_hLhumK5GiyZ_wRkdA5sCKjJjVS-u-+6UcXUon_mCEQ@mail.gmail.com>
 <CAEf6Z5JTbjr3R1QEt0aLty53+Nz2=6Ufk4NwNn1SfyS1wWf13g@mail.gmail.com>
 <CAPcDmStm6Sd7d+T1L74wgkHM5fBW4RAi61W4VOreUbjdmerLtA@mail.gmail.com>
 <CAEf6Z5+s4k2pzUG6wxdTCaQCFKBe6QmrWiBEpdkcQjMPsux2Hg@mail.gmail.com>
 <CAPcDmSuBUKA4YuayaXnydqV=-g6yYeudTZtkPKtCDPWQ3f1GSg@mail.gmail.com>
 <CAEf6Z5+-Pmsj_OVpERSW-GBx0=5prWmYBu5QcLPrnV_t7RBE8Q@mail.gmail.com>
 <CAEf6Z5+M+6fDiPc8U3+k61W3n1g+cH3spyBUH1spJ1ESvVqZpg@mail.gmail.com>
 <CAPcDmSs1=+0xRdYy8isssP64d8AM3DhRkkfviabWWgJNRpcbfw@mail.gmail.com>
 <CAEf6Z5+F_oQkOLYtNztE=qL6CDzS0U7QbWXw2X-U94RzL9V5sQ@mail.gmail.com>
 <CAPcDmSukws7ZZRnmea5hm8z3L0+-VPb1WF=LRwm3PcztH4KrYQ@mail.gmail.com>
 <CAEf6Z5+W898nNhLHxSOeF8T00M-0rxxy155R_7sNSyKacpqEQg@mail.gmail.com>
 <CAPcDmSu_5AtCFJEZTe=Msny9CKzHYcvBoXFDyc1ZpQqdBMYZOQ@mail.gmail.com>
 <CANZDn9swJVGw3UpTpCuN1Lm1fN3q+7eKnuHsuZyoqefudffTMg@mail.gmail.com>
 <CAEf6Z5+ib9aZOUbycCScqGuwK91OJtxVCB5guTbj-i=KRH2u=g@mail.gmail.com>
 <CAPcDmSsS-7spcLKTQ0g2ZH42OeP0JbeYPYaLk32hcE+i6vFd5g@mail.gmail.com>
 <CANZDn9vyDzAP9Z8QUPBSeC1rjFgXaU1nfP3DSWo6c-LbHEK67Q@mail.gmail.com>
From: Nicolas Liochon <nkeywal@gmail.com>
Date: Mon, 23 Mar 2015 17:08:29 +0100
Message-ID: 
 <CAPcDmSt5m+uNyArCO=vg__z70XhiFY7Vhi=Fu6rkD+uX_m0uwA@mail.gmail.com>
Subject: Re: Strange issue when DataNode goes down
To: user <user@hbase.apache.org>
Content-Type: multipart/alternative; boundary=f46d0442861cacb33e0511f6e1c8

--f46d0442861cacb33e0511f6e1c8
Content-Type: text/plain; charset=UTF-8

stale should not help for recoverLease: it helps for reads, but it's the
step after lease recovery.
It's not needed in recoverLease because the recoverLease in hdfs just sorts
the datanode by the heartbeat time, so, usually the stale datanode will be
the last one of the list.


On Mon, Mar 23, 2015 at 4:38 PM, Bryan Beaudreault <bbeaudreault@hubspot.com
> wrote:

> @Nicholas, I see, thanks.  I'll keep the settings at default.  So really if
> everything else is configured properly you should never reach the lease
> recovery timeout in any failure scenarios.  It seems that the staleness
> check would be the thing that prevents this, correct?  I'm surprised it
> didn't help Dejan.
>
> On Mon, Mar 23, 2015 at 11:20 AM, Nicolas Liochon <nkeywal@gmail.com>
> wrote:
>
> > @bryan: yes, you can change hbase.lease.recovery.timeout if you changed
> he
> > hdfs settings. But this setting is really for desperate cases. The
> recover
> > Lease should have succeeded before. As well, if you depend on
> > hbase.lease.recovery.timeout, it means that you're wasting recovery time:
> > the lease should be recovered in a few seconds.
> >
> > On Mon, Mar 23, 2015 at 3:59 PM, Dejan Menges <dejan.menges@gmail.com>
> > wrote:
> >
> > > Interesting discussion I also found, gives me some more light on what
> > > Nicolas is talking about -
> > https://issues.apache.org/jira/browse/HDFS-3703
> > >
> > > On Mon, Mar 23, 2015 at 3:53 PM Bryan Beaudreault <
> > > bbeaudreault@hubspot.com>
> > > wrote:
> > >
> > > > So it is safe to set hbase.lease.recovery.timeout lower if you also
> > > > set heartbeat.recheck.interval lower (lowering that 10.5 min dead
> node
> > > > timer)?  Or is it recommended to not touch either of those?
> > > >
> > > > Reading the above with interest, thanks for digging in here guys.
> > > >
> > > > On Mon, Mar 23, 2015 at 10:13 AM, Nicolas Liochon <nkeywal@gmail.com
> >
> > > > wrote:
> > > >
> > > > > If the node is actually down it's fine. But the node may not be
> that
> > > down
> > > > > (CAP theorem here); and then it's looking for trouble.
> > > > > HDFS, by default declare a node as dead after 10:30. 15 minutes is
> an
> > > > extra
> > > > > security. It seems your hdfs settings are different (or there is a
> > > > bug...).
> > > > > There should be some info in the hdfs logs.
> > > > >
> > > > > On Mon, Mar 23, 2015 at 3:05 PM, Dejan Menges <
> > dejan.menges@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Will take a look.
> > > > > >
> > > > > > Actually, if node is down (someone unplugged network cable, it
> just
> > > > died,
> > > > > > whatever) what's chance it's going to become live so write can
> > > > continue?
> > > > > On
> > > > > > the other side, HBase is not starting recovery trying to contact
> > node
> > > > > which
> > > > > > is not there anymore, and even elected as dead on Namenode side
> > > > (another
> > > > > > thing I didn't understood quite good).
> > > > > >
> > > > > > So what I was expecting is that as soon as Namenode decided node
> is
> > > > dead,
> > > > > > that it would be enough for RegionServer to stop trying to
> recover
> > > from
> > > > > the
> > > > > > dead node, but it wasn't the case. Also, this whole MTTR article
> in
> > > > HBase
> > > > > > book doesn't work at all with this parameter set to it's default
> > > value
> > > > > (15
> > > > > > minutes).
> > > > > >
> > > > > > So I'm still struggling to figure out what's drawback exactly on
> > > this?
> > > > > >
> > > > > > On Mon, Mar 23, 2015 at 2:50 PM Nicolas Liochon <
> nkeywal@gmail.com
> > >
> > > > > wrote:
> > > > > >
> > > > > > > Thanks for the explanation. There is an issue if you modify
> this
> > > > > setting
> > > > > > > however.
> > > > > > > hbase tries to recover the lease (i.e. be sure that nobody is
> > > > writing)
> > > > > > > If you change hbase.lease.recovery.timeout hbase will start the
> > > > > recovery
> > > > > > > (i.e. start to read) even if it's not sure that nobody's
> writing.
> > > > That
> > > > > > > means there is a dataloss risk.
> > > > > > > Basically, you must not see this warning: WARN
> > > > > > > org.apache.hadoop.hbase.util.FSHDFSUtils: Cannot recoverLease
> > after
> > > > > > trying
> > > > > > > for[]
> > > > > > >
> > > > > > > The recoverLease must succeed. The fact that it does not after
> X
> > > > tries
> > > > > is
> > > > > > > strange.
> > > > > > > There may be a mistmatch between the hbase parameters and the
> > hdfs
> > > > > ones.
> > > > > > > You may need to have a look at the comments in FSHDFSUtils.java
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Mon, Mar 23, 2015 at 2:15 PM, Dejan Menges <
> > > > dejan.menges@gmail.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > I found the issue and fixed it, and will try to explain here
> > what
> > > > was
> > > > > > > > exactly in our case in case someone else finds this
> interesting
> > > > too.
> > > > > > > >
> > > > > > > > So initially, we had (couple of times) some network and
> > hardware
> > > > > issues
> > > > > > > in
> > > > > > > > our datacenters. When one server would die (literary die, we
> > had
> > > > some
> > > > > > > issue
> > > > > > > > with PSUs) we saw issues with overall cluster performance on
> > > HBase
> > > > > > side.
> > > > > > > As
> > > > > > > > cluster is quite big and live, it was also quite hard to
> figure
> > > out
> > > > > > exact
> > > > > > > > root cause and how to fix the stuff we wanted to fix.
> > > > > > > >
> > > > > > > > So I set up another cluster, four nodes (with DataNode and
> > > > > > RegionServer)
> > > > > > > > and two other nodes with HMaster and Namenode in HA, using
> same
> > > > stuff
> > > > > > we
> > > > > > > > use on production. We pumped some data into it, and I was
> able
> > to
> > > > > > > reproduce
> > > > > > > > same issue last week on it. What I tried to do is to cut one
> > > server
> > > > > > (shut
> > > > > > > > down it's interface) when all is good with cluster, when we
> > have
> > > > > load,
> > > > > > > and
> > > > > > > > see what's going to happen.
> > > > > > > >
> > > > > > > > On Friday, after Nicolas mentioned, I started taking a look
> in
> > > > HBase
> > > > > > logs
> > > > > > > > on the node which was mentioned in HMaster log as the one
> > taking
> > > > over
> > > > > > > > regions for the dead server. Basically what I was able to
> > observe
> > > > was
> > > > > > 15
> > > > > > > > minutes time (+- couple of seconds, what was also
> interesting,
> > > and
> > > > > will
> > > > > > > got
> > > > > > > > later to that) between HMaster figures out that one of it's
> > > > > > RegionServers
> > > > > > > > is dead, and the time one of the mentioned nodes starts
> taking
> > > over
> > > > > > those
> > > > > > > > regions and they start appearing in HMaster's Web UI.
> > > > > > > >
> > > > > > > > I then set up everything like mentioned here
> > > > > > > > http://hbase.apache.org/book.html#mttr - but still had
> exactly
> > > the
> > > > > > same
> > > > > > > > issues. Went over (again and again) all currently configured
> > > stuff,
> > > > > but
> > > > > > > > still had the same issue.
> > > > > > > >
> > > > > > > > Then I started looking into HDFS. Opened NameNode UI, saw all
> > is
> > > > > good,
> > > > > > > took
> > > > > > > > one node down, was also looking RegionServer logs in the same
> > > time
> > > > -
> > > > > > and
> > > > > > > I
> > > > > > > > also see that it took ~15 minutes for Namenode to elect dead
> > node
> > > > as
> > > > > > > dead.
> > > > > > > > Somehow in the same moment regions started getting back to
> > life.
> > > I
> > > > > > > remember
> > > > > > > > in some older versions dfs timeout checks and check retries.
> > > Looked
> > > > > > into
> > > > > > > > defaults for our Hadoop version -
> > > > > > > >
> > > > > > > > http://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/
> > > > > > > hadoop-hdfs/hdfs-default.xml
> > > > > > > > - and saw there that there's no recheck parameter anymore.
> > > Strange,
> > > > > as
> > > > > > on
> > > > > > > > StackOverflow I found post from month ago, for newer version
> > than
> > > > we
> > > > > > use
> > > > > > > > (we use 2.4.1, guy was using 2.6 - dfs.namenode.heartbeat.
> > > > > > > recheck-interval)
> > > > > > > > I set it to 10 seconds as he mentioned, having checks every
> > three
> > > > > > seconds
> > > > > > > > (default) and got DataNode elected as dead in ~45 seconds, as
> > he
> > > > > > > mentioned.
> > > > > > > > Not sure why this parameter is not documented, but obviously
> it
> > > > > works.
> > > > > > > >
> > > > > > > > Then figured out it still didn't fix our HBase failover
> issue.
> > I
> > > > was
> > > > > > > > looking into HBase book again and again, and then saw this
> > part:
> > > > > > > >
> > > > > > > > "How much time we allow elapse between calls to recover
> lease.
> > > > Should
> > > > > > be
> > > > > > > > larger than the dfs timeout."
> > > > > > > >
> > > > > > > > This was description for hbase.lease.recovery.dfs.timeout.
> > Wasn't
> > > > > sure
> > > > > > > from
> > > > > > > > the comment what of all timeouts that's possible to set in
> > > > > Hadoop/HBase
> > > > > > > and
> > > > > > > > that has something to do with DFS is this all about. But
> picked
> > > > > > > > hbase.lease.recovery.timeout, and that was the catch.
> > > > > > > >
> > > > > > > > Initially, by default, hbase.lease.recovery.timeout is set to
> > 15
> > > > > > minutes.
> > > > > > > > Not sure why, and wasn't able to find yet why, but getting
> this
> > > > down
> > > > > to
> > > > > > > one
> > > > > > > > minute (what's more than OK for us) I was able to get rid of
> > our
> > > > > issue.
> > > > > > > Not
> > > > > > > > also sure why this is not mentioned in MTTR section in HBase
> > > book,
> > > > as
> > > > > > > > obviously MTTR doesn't work at all with this default timeout,
> > at
> > > > > least
> > > > > > it
> > > > > > > > doesn't work the way we expected it to work.
> > > > > > > >
> > > > > > > > So thanks again for everyone being spammed with this, and
> > > specially
> > > > > > > thanks
> > > > > > > > to Nicolas pointing me to the right direction.
> > > > > > > >
> > > > > > > > On Mon, Mar 23, 2015 at 1:37 PM Nicolas Liochon <
> > > nkeywal@gmail.com
> > > > >
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > > the attachments are rejected by the mailing list, can you
> put
> > > > then
> > > > > on
> > > > > > > > > pastebin?
> > > > > > > > >
> > > > > > > > > stale is mandatory (so it's good), but the issue here is
> just
> > > > > before.
> > > > > > > The
> > > > > > > > > region server needs to read the file. In order to be sure
> > that
> > > > > there
> > > > > > is
> > > > > > > > no
> > > > > > > > > data loss, it needs to "recover the lease", that means
> > ensuring
> > > > > that
> > > > > > > > nobody
> > > > > > > > > is writing the file. The regionserver calls the namenode to
> > do
> > > > this
> > > > > > > > > recoverLease. So there should be some info in the namenode
> > > logs.
> > > > > You
> > > > > > > have
> > > > > > > > > HDFS-4721 on your hdfs? The hbase book details (more or
> > > less...)
> > > > > this
> > > > > > > > > recoverLease stuff.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Mon, Mar 23, 2015 at 10:33 AM, Dejan Menges <
> > > > > > dejan.menges@gmail.com
> > > > > > > >
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > And also, just checked - dfs.namenode.avoid.read.stale.
> > > > datanode
> > > > > and
> > > > > > > > > > dfs.namenode.avoid.write.stale.datanode
> > > > > > > > > > are both true, and dfs.namenode.stale.datanode.interval
> is
> > > > set to
> > > > > > > > > default
> > > > > > > > > > 30000.
> > > > > > > > > >
> > > > > > > > > > On Mon, Mar 23, 2015 at 10:03 AM Dejan Menges <
> > > > > > > dejan.menges@gmail.com>
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi Nicolas,
> > > > > > > > > > >
> > > > > > > > > > > Please find log attached.
> > > > > > > > > > >
> > > > > > > > > > > As I see it now more clearly, it was trying to recover
> > HDFS
> > > > > WALs
> > > > > > > from
> > > > > > > > > > node
> > > > > > > > > > > that's dead:
> > > > > > > > > > >
> > > > > > > > > > > 2015-03-23 08:53:44,381 WARN
> > > > > > > > org.apache.hadoop.hbase.util.FSHDFSUtils:
> > > > > > > > > > > Cannot recoverLease after trying for 900000ms
> > > > > > > > > > > (hbase.lease.recovery.timeout); continuing, but may be
> > > > > > > DATALOSS!!!;
> > > > > > > > > > > attempt=40 on
> > > > > > > > > > >
> > > > > > > > > >
> > file=hdfs://{my_hmaster_node}:8020/hbase/WALs/{node_i_intent
> > > > > > > > > ionally_get_down_by_getting_network_down},60020,1426862900
> > > > > > > > > 506-splitting/{node_i_intentionally_get_down_by_gett
> > > > > > > > > ing_network_down}%2C60020%2C1426862900506.1427096924508
> > > > > > > > > > > after 908210ms
> > > > > > > > > > >
> > > > > > > > > > > And as you can see from the log, it tried 40 times,
> what
> > > took
> > > > > it
> > > > > > > > > exactly
> > > > > > > > > > > 15 minutes.
> > > > > > > > > > >
> > > > > > > > > > > There's probably some parameter to tune to cut it of
> from
> > > 40
> > > > > > times
> > > > > > > /
> > > > > > > > 15
> > > > > > > > > > > minutes to something more useful, as for 15 minutes we
> > > don't
> > > > > have
> > > > > > > our
> > > > > > > > > > > regions available, and HDFS have however replication
> > factor
> > > > 3.
> > > > > > > > > > >
> > > > > > > > > > > Googling, if I figure out what's this I will post it
> > here.
> > > > Will
> > > > > > > also
> > > > > > > > > > > appreciate if someone knows how to cut this down.
> > > > > > > > > > >
> > > > > > > > > > > Thanks,
> > > > > > > > > > >
> > > > > > > > > > > Dejan
> > > > > > > > > > >
> > > > > > > > > > > On Fri, Mar 20, 2015 at 3:49 PM Nicolas Liochon <
> > > > > > nkeywal@gmail.com
> > > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > >> The split is done by the region servers (the master
> > > > > > coordinates).
> > > > > > > Is
> > > > > > > > > > there
> > > > > > > > > > >> some interesting stuff in their logs?
> > > > > > > > > > >>
> > > > > > > > > > >> On Fri, Mar 20, 2015 at 3:38 PM, Dejan Menges <
> > > > > > > > dejan.menges@gmail.com
> > > > > > > > > >
> > > > > > > > > > >> wrote:
> > > > > > > > > > >>
> > > > > > > > > > >> > With client issue was that it was retrying
> connecting
> > to
> > > > the
> > > > > > > same
> > > > > > > > > > region
> > > > > > > > > > >> > servers even when they were reassigned. Lowering it
> > down
> > > > > > helped
> > > > > > > in
> > > > > > > > > > this
> > > > > > > > > > >> > specific use case, but yes, we still want servers to
> > > > > > reallocate
> > > > > > > > > > quickly.
> > > > > > > > > > >> >
> > > > > > > > > > >> > What got me here:
> > > > > > > > > > >> >
> > > > > > > > > > >> > http://hbase.apache.org/book.html#mttr
> > > > > > > > > > >> >
> > > > > > > > > > >> > I basically set configuration exactly the same way
> as
> > > it's
> > > > > > > > explained
> > > > > > > > > > >> here.
> > > > > > > > > > >> > *zookeeper.session.timeout* is (and was before)
> 60000
> > > (one
> > > > > > > > minute).
> > > > > > > > > > >> >
> > > > > > > > > > >> > So basically what happens is: - simulating network
> > > issues
> > > > we
> > > > > > had
> > > > > > > > > > >> recently).
> > > > > > > > > > >> > - After short time I see in HBase that my
> RegionServer
> > > is
> > > > > > dead,
> > > > > > > > and
> > > > > > > > > as
> > > > > > > > > > >> > total number of regions I see previous total minus
> > > number
> > > > of
> > > > > > > > regions
> > > > > > > > > > >> that
> > > > > > > > > > >> > were hosted on the node hosting RegionServer that
> just
> > > > > > > > > 'disappeared'.
> > > > > > > > > > >> > - In this point I want my clus
> > > > > > > > > > >> >
> > > > > > > > > > >> > - I have test cluster consisting of four nodes,
> every
> > > node
> > > > > > being
> > > > > > > > > > >> DataNode
> > > > > > > > > > >> > and RegionServer.
> > > > > > > > > > >> > - I simulate network partition on one (connect to it
> > > > through
> > > > > > > > console
> > > > > > > > > > and
> > > > > > > > > > >> > take network interface downter to recover as soon as
> > > > > possible,
> > > > > > > to
> > > > > > > > > > start
> > > > > > > > > > >> > serving missing regions.
> > > > > > > > > > >> > - First thing I see in HMaster logs are:
> > > > > > > > > > >> >
> > > > > > > > > > >> > 2015-03-20 14:17:26,015 INFO
> > > > > > > > > > >> >
> org.apache.hadoop.hbase.zookeeper.RegionServerTracker:
> > > > > > > > RegionServer
> > > > > > > > > > >> > ephemeral node deleted, processing expiration
> > > > > > > > > > >> > [{name_of_node_I_took_down},60020,1426860403261]
> > > > > > > > > > >> >
> > > > > > > > > > >> > 2015-03-20 14:17:26,067 INFO
> > > > > > > > > > >> >
> > > > > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler:
> > > > > > > > > > Splitting
> > > > > > > > > > >> > logs for
> > {name_of_node_I_took_down},60020,1426860403261
> > > > > before
> > > > > > > > > > >> assignment.
> > > > > > > > > > >> >
> > > > > > > > > > >> > 2015-03-20 14:17:26,105 INFO
> > > > > > > > > > >> > org.apache.hadoop.hbase.master.SplitLogManager: dead
> > > > > splitlog
> > > > > > > > > workers
> > > > > > > > > > [
> > > > > > > > > > >> > {name_of_node_I_took_down},60020,1426860403261]
> > > > > > > > > > >> >
> > > > > > > > > > >> > 2015-03-20 14:17:26,107 INFO
> > > > > > > > > > >> > org.apache.hadoop.hbase.master.SplitLogManager:
> > started
> > > > > > > splitting
> > > > > > > > 1
> > > > > > > > > > >> logs in
> > > > > > > > > > >> >
> > [hdfs://{fqdn_of_hmaster}:8020/hbase/WALs/{name_of_node_
> > > > I_
> > > > > > > > > took_down}
> > > > > > > > > > >> > ,60020,1426860403261-splitting]
> > > > > > > > > > >> >
> > > > > > > > > > >> > 2015-03-20 14:17:26,150 INFO
> > > > > > > > > > >> > org.apache.hadoop.hbase.master.SplitLogManager: task
> > > > > > > > > > >> > /hbase/splitWAL/WALs%2F
> > > > > > > > > > >> >
> > > > > > {name_of_node_I_took_down}%2C60020%2C1426860403261-splitting%2F
> > > > > > > > > > >> >
> > {name_of_node_I_took_down}%252C60020%252C1426860403261.
> > > > 14268
> > > > > > > > > 60404905
> > > > > > > > > > >> > acquired by {fqdn_of_another_live_hnode},
> > > > 60020,1426859445623
> > > > > > > > > > >> >
> > > > > > > > > > >> > 2015-03-20 14:17:26,182 INFO
> > > > > > > > > > >> > org.apache.hadoop.hbase.master.SplitLogManager:
> total
> > > > tasks
> > > > > =
> > > > > > 1
> > > > > > > > > > >> unassigned
> > > > > > > > > > >> > = 0 tasks={/hbase/splitWAL/WALs%
> > > > 2F{name_of_node_I_took_down}
> > > > > > > > > > >> >
> > > > > > > > > > >> >
> %2C60020%2C1426860403261-splitting%2F{name_of_node_I_
> > > > took_
> > > > > > > > > > >>
> > > down}%252C60020%252C1426860403261.1426860404905=last_update
> > > > > > > > > > >> > = 1426861046182 last_version = 2 cur_worker_name =
> > > > > > > > > > >> > {fqdn_of_another_live_node},60020,1426859445623
> > status =
> > > > > > > > > in_progress
> > > > > > > > > > >> > incarnation = 0 resubmits = 0 batch = installed = 1
> > done
> > > > = 0
> > > > > > > > error =
> > > > > > > > > > 0}
> > > > > > > > > > >> >
> > > > > > > > > > >> > 2015-03-20 14:17:31,183 INFO
> > > > > > > > > > >> > org.apache.hadoop.hbase.master.SplitLogManager:
> total
> > > > tasks
> > > > > =
> > > > > > 1
> > > > > > > > > > >> unassigned
> > > > > > > > > > >> > = 0 tasks={/hbase/splitWAL/WALs%
> > > > 2F{name_of_node_I_took_down}
> > > > > > > > > > >> >
> > > > > > > > > > >> >
> %2C60020%2C1426860403261-splitting%2F{name_of_node_I_
> > > > took_
> > > > > > > > > > >>
> > > down}%252C60020%252C1426860403261.1426860404905=last_update
> > > > > > > > > > >> > = 1426861046182 last_version = 2 cur_worker_name =
> > > > > > > > > > >> > {fqdn_of_another_live_node},60020,1426859445623
> > status =
> > > > > > > > > in_progress
> > > > > > > > > > >> > incarnation = 0 resubmits = 0 batch = installed = 1
> > done
> > > > = 0
> > > > > > > > error =
> > > > > > > > > > 0}
> > > > > > > > > > >> >
> > > > > > > > > > >> > 2015-03-20 14:17:36,184 INFO
> > > > > > > > > > >> > org.apache.hadoop.hbase.master.SplitLogManager:
> total
> > > > tasks
> > > > > =
> > > > > > 1
> > > > > > > > > > >> unassigned
> > > > > > > > > > >> > = 0 tasks={/hbase/splitWAL/WALs%
> > > > 2F{name_of_node_I_took_down}
> > > > > > > > > > >> >
> > > > > > > > > > >> >
> %2C60020%2C1426860403261-splitting%2F{name_of_node_I_
> > > > took_
> > > > > > > > > > >>
> > > down}%252C60020%252C1426860403261.1426860404905=last_update
> > > > > > > > > > >> > = 1426861046182 last_version = 2 cur_worker_name =
> > > > > > > > > > >> > {fqdn_of_another_live_node},60020,1426859445623
> > status =
> > > > > > > > > in_progress
> > > > > > > > > > >> > incarnation = 0 resubmits = 0 batch = installed = 1
> > done
> > > > = 0
> > > > > > > > error =
> > > > > > > > > > 0}
> > > > > > > > > > >> >
> > > > > > > > > > >> > 2015-03-20 14:17:42,185 INFO
> > > > > > > > > > >> > org.apache.hadoop.hbase.master.SplitLogManager:
> total
> > > > tasks
> > > > > =
> > > > > > 1
> > > > > > > > > > >> unassigned
> > > > > > > > > > >> > = 0 tasks={/hbase/splitWAL/WALs%
> > > > 2F{name_of_node_I_took_down}
> > > > > > > > > > >> >
> > > > > > > > > > >> >
> %2C60020%2C1426860403261-splitting%2F{name_of_node_I_
> > > > took_
> > > > > > > > > > >>
> > > down}%252C60020%252C1426860403261.1426860404905=last_update
> > > > > > > > > > >> > = 1426861046182 last_version = 2 cur_worker_name =
> > > > > > > > > > >> > {fqdn_of_another_live_node},60020,1426859445623
> > status =
> > > > > > > > > in_progress
> > > > > > > > > > >> > incarnation = 0 resubmits = 0 batch = installed = 1
> > done
> > > > = 0
> > > > > > > > error =
> > > > > > > > > > 0}
> > > > > > > > > > >> >
> > > > > > > > > > >> > 2015-03-20 14:17:48,184 INFO
> > > > > > > > > > >> > org.apache.hadoop.hbase.master.SplitLogManager:
> total
> > > > tasks
> > > > > =
> > > > > > 1
> > > > > > > > > > >> unassigned
> > > > > > > > > > >> > = 0 tasks={/hbase/splitWAL/WALs%
> > > > 2F{name_of_node_I_took_down}
> > > > > > > > > > >> >
> > > > > > > > > > >> >
> %2C60020%2C1426860403261-splitting%2F{name_of_node_I_
> > > > took_
> > > > > > > > > > >>
> > > down}%252C60020%252C1426860403261.1426860404905=last_update
> > > > > > > > > > >> > = 1426861046182 last_version = 2 cur_worker_name =
> > > > > > > > > > >> > {fqdn_of_another_live_node},60020,1426859445623
> > status =
> > > > > > > > > in_progress
> > > > > > > > > > >> > incarnation = 0 resubmits = 0 batch = installed = 1
> > done
> > > > = 0
> > > > > > > > error =
> > > > > > > > > > 0}
> > > > > > > > > > >> > In the meantime, In hbase...out log I got this:
> > > > > > > > > > >> >
> > > > > > > > > > >> > ==> hbase-hbase-master-{fqdn_of_my_hmaster_node}.out
> > <==
> > > > > > > > > > >> >
> > > > > > > > > > >> > java.io.IOException: Call to
> > > > > > > > > > >> >
> > > {name_of_node_I_took_down}/{ip_of_local_interface_I_took_
> > > > dow
> > > > > > > > > n}:60020
> > > > > > > > > > >> > failed on local exception:
> > > > > > > > > > >> > org.apache.hadoop.hbase.ipc.RpcClient$
> > > > CallTimeoutException:
> > > > > > > Call
> > > > > > > > > > >> > id=93152, waitTime=60044, rpcTimeout=60000
> > > > > > > > > > >> >
> > > > > > > > > > >> > at
> > > > > > org.apache.hadoop.hbase.ipc.RpcClient.wrapException(RpcClien
> > > > > > > > > > >> t.java:1532)
> > > > > > > > > > >> >
> > > > > > > > > > >> > at
> > > > > org.apache.hadoop.hbase.ipc.RpcClient.call(RpcClient.java:
> > > > > > > 1502)
> > > > > > > > > > >> >
> > > > > > > > > > >> > at
> > > > > > > > > > >> >
> > > > > > > > > > >> >
> > > org.apache.hadoop.hbase.ipc.RpcClient.callBlockingMethod(
> > > > Rpc
> > > > > > > > > > >> Client.java:1684)
> > > > > > > > > > >> >
> > > > > > > > > > >> > at
> > > > > > > > > > >> >
> > > > > > > > > > >> > org.apache.hadoop.hbase.ipc.RpcClient$
> > > > BlockingRpcChannelImpl
> > > > > > > > > ementati
> > > > > > > > > > >> on.callBlockingMethod(RpcClient.java:1737)
> > > > > > > > > > >> >
> > > > > > > > > > >> > at
> > > > > > > > > > >> >
> > > > > > > > > > >> >
> > org.apache.hadoop.hbase.protobuf.generated.AdminProtos$
> > > > > > > > > AdminService$
> > > > > > > > > > >> BlockingStub.getRegionInfo(AdminProtos.java:20806)
> > > > > > > > > > >> >
> > > > > > > > > > >> > at
> > > > > > > > > > >> >
> > > > > > > > > > >> > org.apache.hadoop.hbase.client.HBaseAdmin.
> > > > getCompactionState
> > > > > > > > > > >> (HBaseAdmin.java:2524)
> > > > > > > > > > >> >
> > > > > > > > > > >> > at
> > > > > > > > > > >> >
> > > > > > > > > > >> > org.apache.hadoop.hbase.generated.master.table_jsp._
> > > > jspServi
> > > > > > > > > > >> ce(table_jsp.java:167)
> > > > > > > > > > >> >
> > > > > > > > > > >> > at
> > > > > org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.
> > > > > > > > > java:98)
> > > > > > > > > > >> >
> > > > > > > > > > >> > at
> > > > > > javax.servlet.http.HttpServlet.service(HttpServlet.java:770)
> > > > > > > > > > >> >
> > > > > > > > > > >> > at
> > > > > > org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder
> > > > > > > > > > >> .java:511)
> > > > > > > > > > >> >
> > > > > > > > > > >> > at
> > > > > > > > > > >> >
> > > > > > > > > > >> >
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.
> > > > doFilte
> > > > > > > > > > >> r(ServletHandler.java:1221)
> > > > > > > > > > >> >
> > > > > > > > > > >> > at
> > > > > > > > > > >> >
> > > > > > > > > > >> > org.apache.hadoop.http.lib.StaticUserWebFilter$
> > > > StaticUserFil
> > > > > > > > > > >> ter.doFilter(StaticUserWebFilter.java:109)
> > > > > > > > > > >> >
> > > > > > > > > > >> > at
> > > > > > > > > > >> >
> > > > > > > > > > >> >
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.
> > > > doFilte
> > > > > > > > > > >> r(ServletHandler.java:1212)
> > > > > > > > > > >> >
> > > > > > > > > > >> > at
> > > > > > > > > > >> >
> > > > > > > > > > >> >
> org.apache.hadoop.http.HttpServer$QuotingInputFilter.
> > > > doFilte
> > > > > > > > > > >> r(HttpServer.java:1081)
> > > > > > > > > > >> >
> > > > > > > > > > >> > at
> > > > > > > > > > >> >
> > > > > > > > > > >> >
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.
> > > > doFilte
> > > > > > > > > > >> r(ServletHandler.java:1212)
> > > > > > > > > > >> >
> > > > > > > > > > >> > at
> > > > > > > > > >
> > > > > >
> > org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
> > > > > > > > > > >> >
> > > > > > > > > > >> > at
> > > > > > > > > > >> >
> > > > > > > > > > >> >
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.
> > > > doFilte
> > > > > > > > > > >> r(ServletHandler.java:1212)
> > > > > > > > > > >> >
> > > > > > > > > > >> > at
> > > > > > org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandl
> > > > > > > > > > >> er.java:399)
> > > > > > > > > > >> >
> > > > > > > > > > >> > at
> > > > > > > > > > >> > org.mortbay.jetty.security.SecurityHandler.handle(
> > > > SecurityHa
> > > > > > > > > > >> ndler.java:216)
> > > > > > > > > > >> >
> > > > > > > > > > >> > at
> > > > > > org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandl
> > > > > > > > > > >> er.java:182)
> > > > > > > > > > >> >
> > > > > > > > > > >> > at
> > > > > > org.mortbay.jetty.handler.ContextHandler.handle(ContextHandl
> > > > > > > > > > >> er.java:766)
> > > > > > > > > > >> >
> > > > > > > > > > >> > at
> > > > > > org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.
> > > > > > > > > > >> java:450)
> > > > > > > > > > >> >
> > > > > > > > > > >> > at
> > > > > > > > > > >> >
> > > > > > > > > > >> > org.mortbay.jetty.handler.ContextHandlerCollection.
> > > > handle(Co
> > > > > > > > > > >> ntextHandlerCollection.java:230)
> > > > > > > > > > >> >
> > > > > > > > > > >> > at
> > > > > > org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapp
> > > > > > > > > > >> er.java:152)
> > > > > > > > > > >> >
> > > > > > > > > > >> > at org.mortbay.jetty.Server.handle(Server.java:326)
> > > > > > > > > > >> >
> > > > > > > > > > >> > at
> > > > > > org.mortbay.jetty.HttpConnection.handleRequest(HttpConnectio
> > > > > > > > > > >> n.java:542)
> > > > > > > > > > >> >
> > > > > > > > > > >> > at
> > > > > > > > > > >> >
> > > > > > > > > > >> > org.mortbay.jetty.HttpConnection$RequestHandler.
> > > > headerComple
> > > > > > > > > > >> te(HttpConnection.java:928)
> > > > > > > > > > >> >
> > > > > > > > > > >> > at
> > > > > org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549)
> > > > > > > > > > >> >
> > > > > > > > > > >> > at
> > > > > > > >
> > org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
> > > > > > > > > > >> >
> > > > > > > > > > >> > at
> > > > > > > >
> > org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
> > > > > > > > > > >> >
> > > > > > > > > > >> > at
> > > > > > > > > > >> >
> > > > > > > > > > >> > org.mortbay.io.nio.SelectChannelEndPoint.run(
> > > > SelectChannelEn
> > > > > > > > > > >> dPoint.java:410)
> > > > > > > > > > >> >
> > > > > > > > > > >> > at
> > > > > > > > > > >> >
> > > > > > > > > > >> > org.mortbay.thread.QueuedThreadPool$PoolThread.run(
> > > > > > > > > > >> QueuedThreadPool.java:582)
> > > > > > > > > > >> >
> > > > > > > > > > >> > Caused by: org.apache.hadoop.hbase.ipc.RpcClient$
> > > > > > > > > CallTimeoutException:
> > > > > > > > > > >> Call
> > > > > > > > > > >> > id=93152, waitTime=60044, rpcTimeout=60000
> > > > > > > > > > >> >
> > > > > > > > > > >> > at
> > > > > > > > > > >> >
> > > > > > > > > > >> > org.apache.hadoop.hbase.ipc.RpcClient$Connection.
> > > > cleanupCall
> > > > > > > > > > >> s(RpcClient.java:1234)
> > > > > > > > > > >> >
> > > > > > > > > > >> > at
> > > > > > > > > > >> >
> > > > > > > > > > >> > org.apache.hadoop.hbase.ipc.RpcClient$Connection.
> > > > readRespons
> > > > > > > > > > >> e(RpcClient.java:1171)
> > > > > > > > > > >> >
> > > > > > > > > > >> > at
> > > > > > org.apache.hadoop.hbase.ipc.RpcClient$Connection.run(RpcClie
> > > > > > > > > > >> nt.java:751)
> > > > > > > > > > >> > Beside this same issue, please note that first
> message
> > > was
> > > > > at
> > > > > > > > > > 2015-03-20
> > > > > > > > > > >> > 14:17:26,015. And then (we got to the point when it
> > > > started
> > > > > > > > > > transition):
> > > > > > > > > > >> >
> > > > > > > > > > >> > 2015-03-20 14:32:35,059 INFO
> > > > > > > > > > >> > org.apache.hadoop.hbase.master.SplitLogManager: task
> > > > > > > > > > >> > /hbase/splitWAL/WALs%2F
> > > > > > > > > > >> >
> > > > > > {name_of_node_I_took_down}%2C60020%2C1426860403261-splitting%2F
> > > > > > > > > > >> >
> > {name_of_node_I_took_down}%252C60020%252C1426860403261.
> > > > 14268
> > > > > > > > > 60404905
> > > > > > > > > > >> > entered state: DONE
> > > > > > {fqdn_of_new_live_node},60020,1426859445623
> > > > > > > > > > >> >
> > > > > > > > > > >> > 2015-03-20 14:32:35,109 INFO
> > > > > > > > > > >> > org.apache.hadoop.hbase.master.SplitLogManager: Done
> > > > > splitting
> > > > > > > > > > >> > /hbase/splitWAL/WALs%2F{name_of_node_I_took_down}
> > > > > > > > > > >> >
> > > > > > %2C60020%2C1426860403261-splitting%2F{name_of_node_I_took_down}
> > > > > > > > > > >> > %252C60020%252C1426860403261.1426860404905
> > > > > > > > > > >> >
> > > > > > > > > > >> > 2015-03-20 14:32:35,190 INFO
> > > > > > > > > > >> > org.apache.hadoop.hbase.master.SplitLogManager:
> > finished
> > > > > > > splitting
> > > > > > > > > > >> (more
> > > > > > > > > > >> > than or equal to) 9 bytes in 1 log files in
> > > > > > > > > > >> >
> > > > > > > > > > >> >
> > [hdfs://{fqdn_of_my_hmaster_node}:8020/hbase/WALs/{name_
> > > > of_
> > > > > > > > > > >> node_I_took_down},60020,1426860403261-splitting]
> > > > > > > > > > >> > in 909083ms
> > > > > > > > > > >> >
> > > > > > > > > > >> > 2015-03-20 14:32:35,191 INFO
> > > > org.apache.hadoop.hbase.master
> > > > > > > > > > >> .RegionStates:
> > > > > > > > > > >> > Transitioned {0e7cc87a4ef6c47a779186f5bf79a01c
> > > > state=OPEN,
> > > > > > > > > > >> > ts=1426860639088,
> > > > > > > > > > server={name_of_node_I_took_down},60020,1426860403261}
> > > > > > > > > > >> to
> > > > > > > > > > >> > {0e7cc87a4ef6c47a779186f5bf79a01c state=OFFLINE,
> > > > > > > ts=1426861955191,
> > > > > > > > > > >> server=
> > > > > > > > > > >> > {name_of_node_I_took_down},60020,1426860403261}
> > > > > > > > > > >> >
> > > > > > > > > > >> > 2015-03-20 14:32:35,191 INFO
> > > > org.apache.hadoop.hbase.master
> > > > > > > > > > >> .RegionStates:
> > > > > > > > > > >> > Offlined 0e7cc87a4ef6c47a779186f5bf79a01c from
> > > > > > > > > > >> {name_of_node_I_took_down}
> > > > > > > > > > >> > ,60020,1426860403261
> > > > > > > > > > >> >
> > > > > > > > > > >> > 2015-03-20 14:32:35,191 INFO
> > > > org.apache.hadoop.hbase.master
> > > > > > > > > > >> .RegionStates:
> > > > > > > > > > >> > Transitioned {25ab6e7b42e36ddaa723d71be5954543
> > > > state=OPEN,
> > > > > > > > > > >> > ts=1426860641783,
> > > > > > > > > > server={name_of_node_I_took_down},60020,1426860403261}
> > > > > > > > > > >> to
> > > > > > > > > > >> > {25ab6e7b42e36ddaa723d71be5954543 state=OFFLINE,
> > > > > > > ts=1426861955191,
> > > > > > > > > > >> server=
> > > > > > > > > > >> > {name_of_node_I_took_down},60020,1426860403261}
> > > > > > > > > > >> >
> > > > > > > > > > >> > 2015-03-20 14:32:35,191 INFO
> > > > org.apache.hadoop.hbase.master
> > > > > > > > > > >> .RegionStates:
> > > > > > > > > > >> > Offlined 25ab6e7b42e36ddaa723d71be5954543 from
> > > > > > > > > > >> {name_of_node_I_took_down}
> > > > > > > > > > >> > ,60020,1426860403261
> > > > > > > > > > >> > At this point, note that it finished SplitLogManager
> > > task
> > > > at
> > > > > > > > > 14:32:35
> > > > > > > > > > >> and
> > > > > > > > > > >> > started transitioning just after that. So this is 15
> > > > minutes
> > > > > > > that
> > > > > > > > > I'm
> > > > > > > > > > >> > talking about.
> > > > > > > > > > >> >
> > > > > > > > > > >> > What am I missing?
> > > > > > > > > > >> >
> > > > > > > > > > >> >
> > > > > > > > > > >> > On Fri, Mar 20, 2015 at 2:37 PM Nicolas Liochon <
> > > > > > > > nkeywal@gmail.com>
> > > > > > > > > > >> wrote:
> > > > > > > > > > >> >
> > > > > > > > > > >> > > You've changed the value of
> hbase.zookeeper.timeout
> > to
> > > > 15
> > > > > > > > > minutes? A
> > > > > > > > > > >> very
> > > > > > > > > > >> > > reasonable target is 1 minute before relocating
> the
> > > > > regions.
> > > > > > > > > That's
> > > > > > > > > > >> the
> > > > > > > > > > >> > > default iirc. You can push it to 20s, but then
> > > > > > > > > gc-stopping-the-world
> > > > > > > > > > >> > > becomes more of an issue. 15 minutes is really a
> > lot.
> > > > The
> > > > > > hdfs
> > > > > > > > > stale
> > > > > > > > > > >> mode
> > > > > > > > > > >> > > must always be used, with a lower timeout than the
> > > hbase
> > > > > > one.
> > > > > > > > > > >> > >
> > > > > > > > > > >> > > Client side there should be nothing to do
> (excepted
> > > for
> > > > > > > advanced
> > > > > > > > > > >> stuff);
> > > > > > > > > > >> > at
> > > > > > > > > > >> > > each retry the client checks the location of the
> > > > regions.
> > > > > If
> > > > > > > you
> > > > > > > > > > lower
> > > > > > > > > > >> > the
> > > > > > > > > > >> > > number of retry the client will fail sooner, but
> > > usually
> > > > > you
> > > > > > > > don't
> > > > > > > > > > >> want
> > > > > > > > > > >> > the
> > > > > > > > > > >> > > client to fail, you want the servers to reallocate
> > > > > quickly.
> > > > > > > > > > >> > >
> > > > > > > > > > >> > > On Fri, Mar 20, 2015 at 1:36 PM, Dejan Menges <
> > > > > > > > > > dejan.menges@gmail.com
> > > > > > > > > > >> >
> > > > > > > > > > >> > > wrote:
> > > > > > > > > > >> > >
> > > > > > > > > > >> > > > Hi,
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > > > Sorry for little bit late update, but managed to
> > > > narrow
> > > > > it
> > > > > > > > > little
> > > > > > > > > > >> bit
> > > > > > > > > > >> > > down.
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > > > We didn't update yet, as we are using
> Hortonworks
> > > > > > > distribution
> > > > > > > > > > right
> > > > > > > > > > >> > now,
> > > > > > > > > > >> > > > and even if we update we will get 0.98.4.
> However,
> > > > looks
> > > > > > > that
> > > > > > > > > > issue
> > > > > > > > > > >> > here
> > > > > > > > > > >> > > > was in our use case and configuration (still
> > looking
> > > > > into
> > > > > > > it).
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > > > Basically, initially I saw that when one server
> > goes
> > > > > down,
> > > > > > > we
> > > > > > > > > > start
> > > > > > > > > > >> > > having
> > > > > > > > > > >> > > > performance issues in general, but it managed to
> > be
> > > on
> > > > > our
> > > > > > > > > client
> > > > > > > > > > >> side,
> > > > > > > > > > >> > > due
> > > > > > > > > > >> > > > to caching, and clients were trying to reconnect
> > to
> > > > > nodes
> > > > > > > that
> > > > > > > > > > were
> > > > > > > > > > >> > > offline
> > > > > > > > > > >> > > > and later trying to get regions from those nodes
> > > too.
> > > > > This
> > > > > > > is
> > > > > > > > > > >> basically
> > > > > > > > > > >> > > why
> > > > > > > > > > >> > > > on server side I didn't manage to see anything
> in
> > > logs
> > > > > > that
> > > > > > > > > would
> > > > > > > > > > >> be at
> > > > > > > > > > >> > > > least little bit interesting and point me into
> > > desired
> > > > > > > > > direction.
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > > > Another question that popped up to me is - in
> case
> > > > > server
> > > > > > is
> > > > > > > > > down
> > > > > > > > > > >> (and
> > > > > > > > > > >> > > with
> > > > > > > > > > >> > > > it DataNode and HRegionServer it was hosting) -
> > > what's
> > > > > > > optimal
> > > > > > > > > > time
> > > > > > > > > > >> to
> > > > > > > > > > >> > > set
> > > > > > > > > > >> > > > for HMaster to consider server dead reassign
> > regions
> > > > > > > somewhere
> > > > > > > > > > >> else, as
> > > > > > > > > > >> > > > this is another performance bottleneck we hit
> > during
> > > > > > > inability
> > > > > > > > > to
> > > > > > > > > > >> > access
> > > > > > > > > > >> > > > regions? In our case it's configured to 15
> > minutes,
> > > > and
> > > > > > > simple
> > > > > > > > > > logic
> > > > > > > > > > >> > > tells
> > > > > > > > > > >> > > > me if you want it earlier then configure lower
> > > number
> > > > of
> > > > > > > > > retries,
> > > > > > > > > > >> but
> > > > > > > > > > >> > > issue
> > > > > > > > > > >> > > > is as always in details, so not sure if anyone
> > knows
> > > > > some
> > > > > > > > better
> > > > > > > > > > >> math
> > > > > > > > > > >> > for
> > > > > > > > > > >> > > > this?
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > > > And last question - is it possible to manually
> > force
> > > > > HBase
> > > > > > > to
> > > > > > > > > > >> reassign
> > > > > > > > > > >> > > > regions? In this case, while HMaster is retrying
> > to
> > > > > > contact
> > > > > > > > node
> > > > > > > > > > >> that's
> > > > > > > > > > >> > > > dead, it's impossible to force it using
> 'balancer'
> > > > > > command.
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > > > Thanks a lot!
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > > > Dejan
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > > > On Tue, Mar 17, 2015 at 9:37 AM Dejan Menges <
> > > > > > > > > > >> dejan.menges@gmail.com>
> > > > > > > > > > >> > > > wrote:
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > > > > Hi,
> > > > > > > > > > >> > > > >
> > > > > > > > > > >> > > > > To be very honest - there's no particular
> reason
> > > why
> > > > > we
> > > > > > > > stick
> > > > > > > > > to
> > > > > > > > > > >> this
> > > > > > > > > > >> > > > one,
> > > > > > > > > > >> > > > > beside just lack of time currently to go
> through
> > > > > upgrade
> > > > > > > > > > process,
> > > > > > > > > > >> but
> > > > > > > > > > >> > > > looks
> > > > > > > > > > >> > > > > to me that's going to be next step.
> > > > > > > > > > >> > > > >
> > > > > > > > > > >> > > > > Had a crazy day, didn't have time to go
> through
> > > all
> > > > > logs
> > > > > > > > > again,
> > > > > > > > > > >> plus
> > > > > > > > > > >> > > one
> > > > > > > > > > >> > > > > of the machines (last one where we had this
> > issue)
> > > > is
> > > > > > > fully
> > > > > > > > > > >> > > reprovisioned
> > > > > > > > > > >> > > > > yesterday so I don't have logs from there
> > anymore.
> > > > > > > > > > >> > > > >
> > > > > > > > > > >> > > > > Beside upgrading,  what I will talk about
> today,
> > > can
> > > > > you
> > > > > > > > just
> > > > > > > > > > >> point
> > > > > > > > > > >> > me
> > > > > > > > > > >> > > to
> > > > > > > > > > >> > > > > the specific RPC issue in 0.98.0? Thing is
> that
> > we
> > > > > have
> > > > > > > some
> > > > > > > > > > >> strange
> > > > > > > > > > >> > > > > moments with RPC in this case, and just want
> to
> > > see
> > > > if
> > > > > > > > that's
> > > > > > > > > > the
> > > > > > > > > > >> > same
> > > > > > > > > > >> > > > > thing (and we were even suspecting to RPC).
> > > > > > > > > > >> > > > >
> > > > > > > > > > >> > > > > Thanks a lot!
> > > > > > > > > > >> > > > > Dejan
> > > > > > > > > > >> > > > >
> > > > > > > > > > >> > > > > On Mon, Mar 16, 2015 at 9:32 PM, Andrew
> Purtell
> > <
> > > > > > > > > > >> apurtell@apache.org
> > > > > > > > > > >> > >
> > > > > > > > > > >> > > > > wrote:
> > > > > > > > > > >> > > > >
> > > > > > > > > > >> > > > >> Is there a particular reason why you are
> using
> > > > HBase
> > > > > > > > 0.98.0?
> > > > > > > > > > The
> > > > > > > > > > >> > > latest
> > > > > > > > > > >> > > > >> 0.98 release is 0.98.11. There's a known
> > > > performance
> > > > > > > issue
> > > > > > > > > with
> > > > > > > > > > >> > 0.98.0
> > > > > > > > > > >> > > > >> pertaining to RPC that was fixed in later
> > > releases,
> > > > > you
> > > > > > > > > should
> > > > > > > > > > >> move
> > > > > > > > > > >> > up
> > > > > > > > > > >> > > > >> from
> > > > > > > > > > >> > > > >> 0.98.0. In addition hundreds of improvements
> > and
> > > > bug
> > > > > > > fixes
> > > > > > > > > have
> > > > > > > > > > >> gone
> > > > > > > > > > >> > > > into
> > > > > > > > > > >> > > > >> the ten releases since 0.98.0.
> > > > > > > > > > >> > > > >>
> > > > > > > > > > >> > > > >> On Mon, Mar 16, 2015 at 6:40 AM, Dejan
> Menges <
> > > > > > > > > > >> > dejan.menges@gmail.com
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > > > >> wrote:
> > > > > > > > > > >> > > > >>
> > > > > > > > > > >> > > > >> > Hi All,
> > > > > > > > > > >> > > > >> >
> > > > > > > > > > >> > > > >> > We have a strange issue with HBase
> > performance
> > > > > > (overall
> > > > > > > > > > cluster
> > > > > > > > > > >> > > > >> > performance) in case one of datanodes in
> the
> > > > > cluster
> > > > > > > > > > >> unexpectedly
> > > > > > > > > > >> > > goes
> > > > > > > > > > >> > > > >> > down.
> > > > > > > > > > >> > > > >> >
> > > > > > > > > > >> > > > >> > So scenario is like follows:
> > > > > > > > > > >> > > > >> > - Cluster works fine, it's stable.
> > > > > > > > > > >> > > > >> > - One DataNode unexpectedly goes down (PSU
> > > issue,
> > > > > > > network
> > > > > > > > > > >> issue,
> > > > > > > > > > >> > > > >> anything)
> > > > > > > > > > >> > > > >> > - Whole HBase cluster goes down
> (performance
> > > > > becomes
> > > > > > so
> > > > > > > > bad
> > > > > > > > > > >> that
> > > > > > > > > > >> > we
> > > > > > > > > > >> > > > >> have to
> > > > > > > > > > >> > > > >> > restart all RegionServers to get it back to
> > > > life).
> > > > > > > > > > >> > > > >> >
> > > > > > > > > > >> > > > >> > Most funny and latest issue that happened
> was
> > > > that
> > > > > we
> > > > > > > > added
> > > > > > > > > > new
> > > > > > > > > > >> > node
> > > > > > > > > > >> > > > to
> > > > > > > > > > >> > > > >> the
> > > > > > > > > > >> > > > >> > cluster (having 8 x 4T SATA disks) and we
> > left
> > > > just
> > > > > > > > > DataNode
> > > > > > > > > > >> > running
> > > > > > > > > > >> > > > on
> > > > > > > > > > >> > > > >> it
> > > > > > > > > > >> > > > >> > to give it couple of days to get some data.
> > At
> > > > some
> > > > > > > point
> > > > > > > > > in
> > > > > > > > > > >> time,
> > > > > > > > > > >> > > due
> > > > > > > > > > >> > > > >> to
> > > > > > > > > > >> > > > >> > hardware issue, server rebooted (twice
> during
> > > > three
> > > > > > > > hours)
> > > > > > > > > in
> > > > > > > > > > >> > moment
> > > > > > > > > > >> > > > >> when
> > > > > > > > > > >> > > > >> > it had maybe 5% of data it would have in a
> > > couple
> > > > > of
> > > > > > > > days.
> > > > > > > > > > >> Nothing
> > > > > > > > > > >> > > > else
> > > > > > > > > > >> > > > >> > beside DataNode was running, and once it
> went
> > > > down,
> > > > > > it
> > > > > > > > > > affected
> > > > > > > > > > >> > > > literary
> > > > > > > > > > >> > > > >> > everything, and restarting RegionServers in
> > the
> > > > end
> > > > > > > fixed
> > > > > > > > > it.
> > > > > > > > > > >> > > > >> >
> > > > > > > > > > >> > > > >> > We are using HBase 0.98.0 with Hadoop 2.4.0
> > > > > > > > > > >> > > > >> >
> > > > > > > > > > >> > > > >>
> > > > > > > > > > >> > > > >>
> > > > > > > > > > >> > > > >>
> > > > > > > > > > >> > > > >> --
> > > > > > > > > > >> > > > >> Best regards,
> > > > > > > > > > >> > > > >>
> > > > > > > > > > >> > > > >>    - Andy
> > > > > > > > > > >> > > > >>
> > > > > > > > > > >> > > > >> Problems worthy of attack prove their worth
> by
> > > > > hitting
> > > > > > > > back.
> > > > > > > > > -
> > > > > > > > > > >> Piet
> > > > > > > > > > >> > > Hein
> > > > > > > > > > >> > > > >> (via Tom White)
> > > > > > > > > > >> > > > >>
> > > > > > > > > > >> > > > >
> > > > > > > > > > >> > > > >
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > >
> > > > > > > > > > >> >
> > > > > > > > > > >>
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

--f46d0442861cacb33e0511f6e1c8--