hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nicolas Liochon <nkey...@gmail.com>
Subject Re: Slow region server recoveries
Date Sat, 20 Apr 2013 08:16:24 GMT
Hi,

I looked at it again with a fresh eye. As Varun was saying, the root cause
is the wrong order of the block locations.

The root cause of the root cause is actually simple: HBASE started the
recovery while the node was not yet stale from an HDFS pov.

Varun mentioned this timing:
Lost Beat: 27:30
Became stale: 27:50 - * this is a guess and reverse engineered (stale
timeout 20 seconds)
Became dead: 37:51

But the  recovery started at 27:13 (15 seconds before we have this log
line)
2013-04-19 00:27:28,432 WARN org.apache.hadoop.hdfs.DFSClient: Failed to
connect to /10.156.194.94:50010 for file
/hbase/feeds/1479495ad2a02dceb41f093ebc29fe4f/home/02f639bb43944d4ba9abcf58287831c0
for block
BP-696828882-10.168.7.226-1364886167971:blk_-5977178030490858298_99853:java.net.SocketTimeoutException:
15000 millis timeout while waiting for channel to be ready for connect. ch
: java.nio.channels.SocketChannel[connection-pending remote=/
10.156.194.94:50010]

So when we took the blocks from the NN, the datanode was not stale, so you
have the wrong (random) order.

ZooKeeper can expire a session before the timeout. I don't what why it does
this in this case, but I don't consider it as a ZK bug: if ZK knows that a
node is dead, it's its role to expire the session. There is something more
fishy: we started the recovery while the datanode was still responding to
heartbeat. I don't know why. Maybe the OS has been able to kill 15 the RS
before vanishing away.

Anyway, we then have an exception when we try to connect, because the RS
does not have a TCP connection to this datanode. And this is retried many
times.

You would not have this with trunk, because HBASE-6435 reorders the blocks
inside the client, using an information not available to the NN, excluding
the datanode of the region server under recovery.

Some conclusions:
 - we should likely backport hbase-6435 to 0.94.
 - I will revive HDFS-3706 and HDFS-3705 (the non hacky way to get
hbase-6435).
 - There are some stuff that could be better in HDFS. I will see.
 - I'm worried by the SocketTimeoutException. We should get NoRouteToHost
at a moment, and we don't. That's also why it takes ages. I think it's an
AWS thing, but it brings to issue: it's slow, and, in HBase, you don't know
if the operation could have been executed or not, so it adds complexity to
some scenarios. If someone with enough network and AWS knowledge could
clarify this point it would be great.

 Cheers,

 Nicolas









On Fri, Apr 19, 2013 at 10:10 PM, Varun Sharma <varun@pinterest.com> wrote:

> This is 0.94.3 hbase...
>
>
> On Fri, Apr 19, 2013 at 1:09 PM, Varun Sharma <varun@pinterest.com> wrote:
>
> > Hi Ted,
> >
> > I had a long offline discussion with nicholas on this. Looks like the
> last
> > block which was still being written too, took an enormous time to
> recover.
> > Here's what happened.
> > a) Master split tasks and region servers process them
> > b) Region server tries to recover lease for each WAL log - most cases are
> > noop since they are already rolled over/finalized
> > c) The last file lease recovery takes some time since the crashing server
> > was writing to it and had a lease on it - but basically we have the
> lease 1
> > minute after the server was lost
> > d) Now we start the recovery for this but we end up hitting the stale
> data
> > node which is puzzling.
> >
> > It seems that we did not hit the stale datanode when we were trying to
> > recover the finalized WAL blocks with trivial lease recovery. However,
> for
> > the final block, we hit the stale datanode. Any clue why this might be
> > happening ?
> >
> > Varun
> >
> >
> > On Fri, Apr 19, 2013 at 10:40 AM, Ted Yu <yuzhihong@gmail.com> wrote:
> >
> >> Can you show snippet from DN log which mentioned UNDER_RECOVERY ?
> >>
> >> Here is the criteria for stale node checking to kick in (from
> >>
> >>
> https://issues.apache.org/jira/secure/attachment/12544897/HDFS-3703-trunk-read-only.patch
> >> ):
> >>
> >> +   * Check if the datanode is in stale state. Here if
> >> +   * the namenode has not received heartbeat msg from a
> >> +   * datanode for more than staleInterval (default value is
> >> +   * {@link
> >> DFSConfigKeys#DFS_NAMENODE_STALE_DATANODE_INTERVAL_MILLI_DEFAULT}),
> >> +   * the datanode will be treated as stale node.
> >>
> >>
> >> On Fri, Apr 19, 2013 at 10:28 AM, Varun Sharma <varun@pinterest.com>
> >> wrote:
> >>
> >> > Is there a place to upload these logs ?
> >> >
> >> >
> >> > On Fri, Apr 19, 2013 at 10:25 AM, Varun Sharma <varun@pinterest.com>
> >> > wrote:
> >> >
> >> > > Hi Nicholas,
> >> > >
> >> > > Attached are the namenode, dn logs (of one of the healthy replicas
> of
> >> the
> >> > > WAL block) and the rs logs which got stuch doing the log split.
> Action
> >> > > begins at 2013-04-19 00:27*.
> >> > >
> >> > > Also, the rogue block is 5723958680970112840_174056. Its very
> >> interesting
> >> > > to trace this guy through the HDFS logs (dn and nn).
> >> > >
> >> > > Btw, do you know what the UNDER_RECOVERY stage is for, in HDFS ?
> Also
> >> > does
> >> > > the stale node stuff kick in for that state ?
> >> > >
> >> > > Thanks
> >> > > Varun
> >> > >
> >> > >
> >> > > On Fri, Apr 19, 2013 at 4:00 AM, Nicolas Liochon <nkeywal@gmail.com
> >> > >wrote:
> >> > >
> >> > >> Thanks for the detailed scenario and analysis. I'm going to have
a
> >> look.
> >> > >> I can't access the logs (ec2-107-20-237-30.compute-1.amazonaws.com
> >> > >> timeouts), could you please send them directly to me?
> >> > >>
> >> > >> Thanks,
> >> > >>
> >> > >> Nicolas
> >> > >>
> >> > >>
> >> > >> On Fri, Apr 19, 2013 at 12:46 PM, Varun Sharma <
> varun@pinterest.com>
> >> > >> wrote:
> >> > >>
> >> > >> > Hi Nicholas,
> >> > >> >
> >> > >> > Here is the failure scenario, I have dug up the logs.
> >> > >> >
> >> > >> > A machine fails and stops accepting/transmitting traffic.
The
> >> HMaster
> >> > >> > starts the distributed split for 13 tasks. There are 12 region
> >> > servers.
> >> > >> 12
> >> > >> > tasks succeed but the 13th one takes a looong time.
> >> > >> >
> >> > >> > Zookeeper timeout is set to 30 seconds. Stale node timeout
is 20
> >> > >> seconds.
> >> > >> > Both patches are there.
> >> > >> >
> >> > >> > a) Machine fails around 27:30
> >> > >> > b) Master starts the split around 27:40 and submits the tasks.
> The
> >> one
> >> > >> task
> >> > >> > which fails seems to be the one which contains the WAL being
> >> currently
> >> > >> > written to:
> >> > >> >
> >> > >> > 2013-04-19 00:27:44,325 INFO
> >> > >> > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting
> >> hlog:
> >> > >> > hdfs://
> >> > >> >
> >> > >> >
> >> > >>
> >> >
> >>
> ec2-107-20-237-30.compute-1.amazonaws.com/hbase/.logs/ip-10-156-194-94.ec2.internal,60020,1366323217601-splitting/ip-10-156-194-94.ec2.internal%2C60020%2C1366323217601.1366331156141
> >> > >> > ,
> >> > >> > length=0
> >> > >> >
> >> > >> > Basically this region server picks up the task but finds
the
> >> length of
> >> > >> this
> >> > >> > file to be 0 and drops. This happens again
> >> > >> >
> >> > >> > c) Finally another region server picks up the task but it
ends up
> >> > going
> >> > >> to
> >> > >> > the bad datanode which should not happen because of the stale
> node
> >> > >> timeout)
> >> > >> > Unfortunately it hits the 45 retries and a connect timeout
of 20
> >> > seconds
> >> > >> > every time. This delays recovery significantly. Now I guess
> >> reducing #
> >> > >> of
> >> > >> > retries to 1 is one possibility.
> >> > >> > But then the namenode logs are very interesting.
> >> > >> >
> >> > >> > d) Namenode seems to be in cyclic lease recovery loop until
the
> >> node
> >> > is
> >> > >> > marked dead. There is this one last block which exhibits
this.
> >> > >> >
> >> > >> > 2013-04-19 00:28:09,744 INFO BlockStateChange: BLOCK* blk_-*
> >> > >> > 5723958680970112840_174056*{blockUCState=UNDER_RECOVERY,
> >> > >> > primaryNodeIndex=1,
> >> > >> > replicas=[ReplicaUnderConstruction[10.156.194.94:50010|RBW],
> >> > >> > ReplicaUnderConstruction[10.156.192.106:50010|RBW],
> >> > >> > ReplicaUnderConstruction[10.156.195.38:50010|RBW]]} recovery
> >> started,
> >> > >> > primary=10.156.192.106:50010
> >> > >> > 2013-04-19 00:28:09,744 WARN org.apache.hadoop.hdfs.StateChange:
> >> DIR*
> >> > >> > NameSystem.internalReleaseLease: File
> >> > >> >
> >> > >> >
> >> > >>
> >> >
> >>
> /hbase/.logs/ip-10-156-194-94.ec2.internal,60020,1366323217601-splitting/ip-10-156-194-94.ec2.internal%2C60020%2C1366323217601.1366331156141
> >> > >> > has not been closed. Lease recovery is in progress. RecoveryId
=
> >> > 174413
> >> > >> for
> >> > >> > block
> blk_-5723958680970112840_174056{blockUCState=UNDER_RECOVERY,
> >> > >> > primaryNodeIndex=1,
> >> > >> > replicas=[ReplicaUnderConstruction[10.156.194.94:50010|RBW],
> >> > >> > ReplicaUnderConstruction[10.156.192.106:50010|RBW],
> >> > >> > ReplicaUnderConstruction[10.156.195.38:50010|RBW]]}
> >> > >> >
> >> > >> > I see this over and over again in the logs until the datanode
is
> >> > marked
> >> > >> > dead. It seems to be cycling through the replicas for this
WAL
> >> block
> >> > and
> >> > >> > trying to add it to the recovery list. I looked at the code
and
> it
> >> > says:
> >> > >> >
> >> > >> >       // Cannot close file right now, since the last block
> requires
> >> > >> > recovery.
> >> > >> >       // This may potentially cause infinite loop in lease
> recovery
> >> > >> >       // if there are no valid replicas on data-nodes.
> >> > >> >       NameNode.stateChangeLog.warn(
> >> > >> >                 "DIR* NameSystem.internalReleaseLease: "
+
> >> > >> >                 "File " + src + " has not been closed." +
> >> > >> >                " Lease recovery is in progress. " +
> >> > >> >                 "RecoveryId = " + blockRecoveryId + " for
block
> " +
> >> > >> > lastBlock);
> >> > >> >       break;
> >> > >> >
> >> > >> > Eventually for this block, we get
> >> > >> >
> >> > >> > 2013-04-19 00:41:20,736 INFO
> >> > >> > org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
> >> > >> >
> >> > >> >
> >> > >>
> >> >
> >>
> commitBlockSynchronization(lastblock=BP-696828882-10.168.7.226-1364886167971:blk_-
> >> > >> > *5723958680970112840_174056*, newgenerationstamp=174413,
> >> > >> > newlength=119148648, newtargets=[10.156.192.106:50010,
> >> > >> 10.156.195.38:50010
> >> > >> > ],
> >> > >> > closeFile=true, deleteBlock=false)
> >> > >> > 2013-04-19 00:41:20,736 ERROR
> >> > >> > org.apache.hadoop.security.UserGroupInformation:
> >> > >> PriviledgedActionException
> >> > >> > as:hdfs (auth:SIMPLE) cause:java.io.IOException: Block
> >> > >> >
> >> > >>
> >> >
> >>
> (=BP-696828882-10.168.7.226-1364886167971:blk_-5723958680970112840_174056)
> >> > >> > not found
> >> > >> > 2013-04-19 00:41:20,736 INFO org.apache.hadoop.ipc.Server:
IPC
> >> Server
> >> > >> > handler 35 on 8020, call
> >> > >> >
> >> > >> >
> >> > >>
> >> >
> >>
> org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.commitBlockSynchronization
> >> > >> > from 10.156.192.106:53271: error: java.io.IOException: Block
> >> > >> >
> >> > >>
> >> >
> >>
> (=BP-696828882-10.168.7.226-1364886167971:blk_-5723958680970112840_174056)
> >> > >> > not found
> >> > >> >
> >> > >> > On the datanode side, i see a call for recover blocks - I
see
> that
> >> a
> >> > >> write
> >> > >> > pipeline is there, which gets terminated with some socket
> >> timeouts...
> >> > >> >
> >> > >> > 00:28:11,471 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode:
> >> > >> NameNode
> >> > >> > at
> ec2-107-20-237-30.compute-1.amazonaws.com/10.168.7.226:8020calls
> >> > >> >
> >> > >> >
> >> > >>
> >> >
> >>
> recoverBlock(BP-696828882-10.168.7.226-1364886167971:blk_-5723958680970112840_174056,
> >> > >> > targets=[10.156.194.94:50010, 10.156.192.106:50010,
> >> > 10.156.195.38:50010
> >> > >> ],
> >> > >> > newGenerationStamp=174413)
> >> > >> >
> >> > >> > Not sure but this looks like a case where data could be lost
  ?
> >> > >> >
> >> > >> > Varun
> >> > >> >
> >> > >> >
> >> > >> > On Fri, Apr 19, 2013 at 12:38 AM, Nicolas Liochon <
> >> nkeywal@gmail.com>
> >> > >> > wrote:
> >> > >> >
> >> > >> > > Hey Varun,
> >> > >> > >
> >> > >> > > Could you please share the logs and the configuration
(hdfs /
> >> hbase
> >> > >> > > settings + cluster description). What's the failure
scenario?
> >> > >> > > From an HDFS pov, HDFS 3703 does not change the dead
node
> status.
> >> > But
> >> > >> > these
> >> > >> > > node will be given the lowest priority when reading.
> >> > >> > >
> >> > >> > >
> >> > >> > > Cheers,
> >> > >> > >
> >> > >> > > Nicolas
> >> > >> > >
> >> > >> > >
> >> > >> > > On Fri, Apr 19, 2013 at 3:01 AM, Varun Sharma <
> >> varun@pinterest.com>
> >> > >> > wrote:
> >> > >> > >
> >> > >> > > > Hi,
> >> > >> > > >
> >> > >> > > > We are facing problems with really slow HBase region
server
> >> > >> recoveries
> >> > >> > ~
> >> > >> > > 20
> >> > >> > > > minuted. Version is hbase 0.94.3 compiled with
> >> hadoop.profile=2.0.
> >> > >> > > >
> >> > >> > > > Hadoop version is CDH 4.2 with HDFS 3703 and HDFS
3912
> patched
> >> and
> >> > >> > stale
> >> > >> > > > node timeouts configured correctly. Time for dead
node
> >> detection
> >> > is
> >> > >> > still
> >> > >> > > > 10 minutes.
> >> > >> > > >
> >> > >> > > > We see that our region server is trying to read
an HLog is
> >> stuck
> >> > >> there
> >> > >> > > for
> >> > >> > > > a long time. Logs here:
> >> > >> > > >
> >> > >> > > > 2013-04-12 21:14:30,248 WARN
> org.apache.hadoop.hdfs.DFSClient:
> >> > >> Failed
> >> > >> > to
> >> > >> > > > connect to /10.156.194.251:50010 for file
> >> > >> > > >
> >> > >> > > >
> >> > >> > >
> >> > >> >
> >> > >>
> >> >
> >>
> /hbase/feeds/fbe25f94ed4fa37fb0781e4a8efae142/home/1d102c5238874a5d82adbcc09bf06599
> >> > >> > > > for block
> >> > >> > > >
> >> > >> > > >
> >> > >> > >
> >> > >> >
> >> > >>
> >> >
> >>
> BP-696828882-10.168.7.226-1364886167971:blk_-3289968688911401881_9428:java.net.SocketTimeoutException:
> >> > >> > > > 15000 millis timeout while waiting for channel
to be ready
> for
> >> > read.
> >> > >> > ch :
> >> > >> > > > java.nio.channels.SocketChannel[connected local=/
> >> > >> 10.156.192.173:52818
> >> > >> > > > remote=/
> >> > >> > > > 10.156.194.251:50010]
> >> > >> > > >
> >> > >> > > > I would think that HDFS 3703 would make the server
fail fast
> >> and
> >> > go
> >> > >> to
> >> > >> > > the
> >> > >> > > > third datanode. Currently, the recovery seems way
too slow
> for
> >> > >> > production
> >> > >> > > > usage...
> >> > >> > > >
> >> > >> > > > Varun
> >> > >> > > >
> >> > >> > >
> >> > >> >
> >> > >>
> >> > >
> >> > >
> >> >
> >>
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message