hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jonathan Gray (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HBASE-3380) Master failover can split logs of live servers
Date Tue, 21 Dec 2010 20:19:01 GMT

    [ https://issues.apache.org/jira/browse/HBASE-3380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12973861#action_12973861
] 

Jonathan Gray commented on HBASE-3380:
--------------------------------------

All great ideas.  Been wanting to punt on 0.92 and do it right there.

I think it will be sufficient for 0.90 and this jira to use my patch but change timeout to
4500ms, interval to 1500ms.  Only 50% increase in waiting (but because of interval + timeout
you might actually wait less).  And I like these new configs if you aren't against them.

So I propose my patch w/ change of defaults to 1500/4500.

> Master failover can split logs of live servers
> ----------------------------------------------
>
>                 Key: HBASE-3380
>                 URL: https://issues.apache.org/jira/browse/HBASE-3380
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Jean-Daniel Cryans
>            Priority: Blocker
>             Fix For: 0.90.0
>
>         Attachments: HBASE-3380-v1.patch
>
>
> The reason why TestMasterFailover fails is that when it does the master failover, the
new master doesn't wait long enough for all region servers to checkin so it goes ahead and
split logs... which doesn't work because of the way lease timeouts work:
> {noformat}
> 2010-12-21 07:30:36,977 DEBUG [Master:0;vesta.apache.org:33170] wal.HLogSplitter(256):
Splitting hlog 1 of 1:
>  hdfs://localhost:49187/user/hudson/.logs/vesta.apache.org,38743,1292916616340/vesta.apache.org%3A38743.1292916617204,
length=0
> 2010-12-21 07:30:36,977 DEBUG [WriterThread-1] wal.HLogSplitter$WriterThread(619): Writer
thread Thread[WriterThread-1,5,main]: starting
> 2010-12-21 07:30:36,977 DEBUG [WriterThread-2] wal.HLogSplitter$WriterThread(619): Writer
thread Thread[WriterThread-2,5,main]: starting
> 2010-12-21 07:30:36,977 INFO  [Master:0;vesta.apache.org:33170] util.FSUtils(625): Recovering
file
>  hdfs://localhost:49187/user/hudson/.logs/vesta.apache.org,38743,1292916616340/vesta.apache.org%3A38743.1292916617204
> 2010-12-21 07:30:36,979 WARN  [IPC Server handler 8 on 49187] namenode.FSNamesystem(1122):
DIR* NameSystem.startFile:
>  failed to create file /user/hudson/.logs/vesta.apache.org,38743,1292916616340/vesta.apache.org%3A38743.1292916617204
for
>  DFSClient_hb_m_vesta.apache.org:33170_1292916630791 on client 127.0.0.1, because this
file is already being created by
>  DFSClient_hb_rs_vesta.apache.org,38743,1292916616340_1292916617166 on 127.0.0.1
> ...
> 2010-12-21 07:33:44,332 WARN  [Master:0;vesta.apache.org:33170] util.FSUtils(644): Waited
187354ms for lease recovery on
>  hdfs://localhost:49187/user/hudson/.logs/vesta.apache.org,38743,1292916616340/vesta.apache.org%3A38743.1292916617204:
>  org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to create file
>  /user/hudson/.logs/vesta.apache.org,38743,1292916616340/vesta.apache.org%3A38743.1292916617204
>  for DFSClient_hb_m_vesta.apache.org:33170_1292916630791 on client 127.0.0.1, because
this file is already
>  being created by DFSClient_hb_rs_vesta.apache.org,38743,1292916616340_1292916617166
on 127.0.0.1
> {noformat}
> I think that we should always check in ZK the number of live region servers before waiting
for them to check in, this way we know how many we should expect during failover. There's
also a case where we still want to timeout, since RS can die during that time, but we should
wait a bit longer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message