hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kannan Muthukkaruppan (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HBASE-2235) Mechanism that would not have -ROOT- and .META. on same server caused failed assign of .META.
Date Fri, 19 Feb 2010 21:54:28 GMT

    [ https://issues.apache.org/jira/browse/HBASE-2235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12835979#action_12835979
] 

Kannan Muthukkaruppan commented on HBASE-2235:
----------------------------------------------

I managed to get the .META. table inconsistent again in my small test cluster under load.
The region server went down due to some errors from the HDFS layer... which we are separately
following up on (probably just too much compaction, and stuff going on at the same time).

I know I can run the add_table to restore its sanity. But a few times now we have managed
to get .META. inconsistent that it might make sense to do something about it in the 0.20.x
timeframe.. (either make .META. updates atomic or have the meta scanner perhaps fix broken
children).

So, roughly here is what happened today.

(i) A RS got a lot of org.apache.hadoop.hdfs.server.namenode.NotReplicatedYetException errors
followed by:

{code}
2010-02-19 08:49:07,102 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block blk_9144926768183088527_186431
bad datanode[0] nodes == null
2010-02-19 08:49:07,102 WARN org.apache.hadoop.hdfs.DFSClient: Could not get block locations.
Source file "/hbase-kannan1/test1/580635726/actions/133921297\
0969249937" - Aborting...
2010-02-19 08:49:07,117 FATAL org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Replay
of hlog required. Forcing server shutdow
{code}

(ii) During shutdown there were other errors like:

{code}
2010-02-19 08:51:07,557 ERROR org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction/Split
failed for region test1,1761194,1266576717079
java.io.IOException: Filesystem closed
2010-02-19 08:51:07,660 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: Shutting
down HRegionServer: file system not available
java.io.IOException: File system is not available
...

2010-02-19 08:50:07,321 WARN org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Tried to
hold up flushing for compactions of region test1,1761194,126657\
6717079 but have waited longer than 90000ms, continuing
2010-02-19 08:50:07,322 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: NOT flushing memstore
for region test1,1761194,1266576717079, flushing=false, w\
ritesEnabled=false
2010-02-19 08:50:07,348 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server Responder, call
put([B@39804c99, [Lorg.apache.hadoop.hbase.client.Put;@1624ee4d)\
 from 10.131.1.186:36796: output error
2010-02-19 08:50:07,354 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server Responder, call
put([B@5f3034b2, [Lorg.apache.hadoop.hbase.client.Put;@55d3c2f0)\
 from 10.131.1.186:36796: output error
2010-02-19 08:50:07,354 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 82 on 60020
caught: java.nio.channels.ClosedChannelException
        at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:126)
        at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
        at org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:1125)
        at org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processResponse(HBaseServer.java:615)
        at org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBaseServer.java:679)
        at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:943)
{code}

After all this, when I restarted the RS. But several regions seem to be in odd state in .META.
For example, for a particular startkey, I see all these entries:
{code}
test1,1204765,1266569946560 column=info:regioninfo, timestamp=1266581302018, value=REGION
=> {NAME => 'test1,
                             1204765,1266569946560', STARTKEY => '1204765', ENDKEY =>
'1441091', ENCODED => 18
                             19368969, OFFLINE => true, SPLIT => true, TABLE => {{NAME
=> 'test1', FAMILIES =>
                              [{NAME => 'actions', VERSIONS => '3', COMPRESSION =>
'NONE', TTL => '2147483647'
                             , BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE
=> 'true'}]}}
 test1,1204765,1266569946560 column=info:server, timestamp=1266570029133, value=10.129.68.212:60020
 test1,1204765,1266569946560 column=info:serverstartcode, timestamp=1266570029133, value=1266562597546
 test1,1204765,1266569946560 column=info:splitB, timestamp=1266581302018, value=\x00\x071441091\x00\x00\x00\x0
                             1\x26\xE6\x1F\xDF\x27\x1Btest1,1290703,1266581233447\x00\x071290703\x00\x00\x00\x
                             05\x05test1\x00\x00\x00\x00\x00\x02\x00\x00\x00\x07IS_ROOT\x00\x00\x00\x05false\x
                             00\x00\x00\x07IS_META\x00\x00\x00\x05false\x00\x00\x00\x01\x07\x07actions\x00\x00
                             \x00\x07\x00\x00\x00\x0BBLOOMFILTER\x00\x00\x00\x05false\x00\x00\x00\x0BCOMPRESSI
                             ON\x00\x00\x00\x04NONE\x00\x00\x00\x08VERSIONS\x00\x00\x00\x013\x00\x00\x00\x03TT
                             L\x00\x00\x00\x0A2147483647\x00\x00\x00\x09BLOCKSIZE\x00\x00\x00\x0565536\x00\x00
                             \x00\x09IN_MEMORY\x00\x00\x00\x05false\x00\x00\x00\x0ABLOCKCACHE\x00\x00\x00\x04t
                             rueh\x0FQ\xCF
 test1,1204765,1266581233447 column=info:regioninfo, timestamp=1266609172177, value=REGION
=> {NAME => 'test1,
                             1204765,1266581233447', STARTKEY => '1204765', ENDKEY =>
'1290703', ENCODED => 13
                             73493090, OFFLINE => true, SPLIT => true, TABLE => {{NAME
=> 'test1', FAMILIES =>
                              [{NAME => 'actions', VERSIONS => '3', COMPRESSION =>
'NONE', TTL => '2147483647'
                             , BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE
=> 'true'}]}}
 test1,1204765,1266581233447 column=info:server, timestamp=1266604768670, value=10.129.68.213:60020
 test1,1204765,1266581233447 column=info:serverstartcode, timestamp=1266604768670, value=1266562597511
 test1,1204765,1266581233447 column=info:splitA, timestamp=1266609172177, value=\x00\x071226169\x00\x00\x00\x0
                             1\x26\xE7\xCA,\x7D\x1Btest1,1204765,1266609171581\x00\x071204765\x00\x00\x00\x05\
                             x05test1\x00\x00\x00\x00\x00\x02\x00\x00\x00\x07IS_ROOT\x00\x00\x00\x05false\x00\
                             x00\x00\x07IS_META\x00\x00\x00\x05false\x00\x00\x00\x01\x07\x07actions\x00\x00\x0
                             0\x07\x00\x00\x00\x0BBLOOMFILTER\x00\x00\x00\x05false\x00\x00\x00\x0BCOMPRESSION\
                             x00\x00\x00\x04NONE\x00\x00\x00\x08VERSIONS\x00\x00\x00\x013\x00\x00\x00\x03TTL\x
                             00\x00\x00\x0A2147483647\x00\x00\x00\x09BLOCKSIZE\x00\x00\x00\x0565536\x00\x00\x0
                             0\x09IN_MEMORY\x00\x00\x00\x05false\x00\x00\x00\x0ABLOCKCACHE\x00\x00\x00\x04true
                             \xB9\xBD\xFEO
 test1,1204765,1266581233447 column=info:splitB, timestamp=1266609172177, value=\x00\x071290703\x00\x00\x00\x0
                             1\x26\xE7\xCA,\x7D\x1Btest1,1226169,1266609171581\x00\x071226169\x00\x00\x00\x05\
                             x05test1\x00\x00\x00\x00\x00\x02\x00\x00\x00\x07IS_ROOT\x00\x00\x00\x05false\x00\
                             x00\x00\x07IS_META\x00\x00\x00\x05false\x00\x00\x00\x01\x07\x07actions\x00\x00\x0
                             0\x07\x00\x00\x00\x0BBLOOMFILTER\x00\x00\x00\x05false\x00\x00\x00\x0BCOMPRESSION\
                             x00\x00\x00\x04NONE\x00\x00\x00\x08VERSIONS\x00\x00\x00\x013\x00\x00\x00\x03TTL\x
                             00\x00\x00\x0A2147483647\x00\x00\x00\x09BLOCKSIZE\x00\x00\x00\x0565536\x00\x00\x0
                             0\x09IN_MEMORY\x00\x00\x00\x05false\x00\x00\x00\x0ABLOCKCACHE\x00\x00\x00\x04true
                             \xE1\xDF\xF8p
 test1,1204765,1266609171581 column=info:regioninfo, timestamp=1266609172212, value=REGION
=> {NAME => 'test1,
                             1204765,1266609171581', STARTKEY => '1204765', ENDKEY =>
'1226169', ENCODED => 21
                             34878372, TABLE => {{NAME => 'test1', FAMILIES => [{NAME
=> 'actions', VERSIONS =
                             > '3', COMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE
=> '65536', IN_MEMOR
                             Y => 'false', BLOCKCACHE => 'true'}]}}
{code} 

> Mechanism that would not have -ROOT- and .META. on same server caused failed assign of
.META.
> ---------------------------------------------------------------------------------------------
>
>                 Key: HBASE-2235
>                 URL: https://issues.apache.org/jira/browse/HBASE-2235
>             Project: Hadoop HBase
>          Issue Type: Bug
>            Reporter: stack
>             Fix For: 0.20.4, 0.21.0
>
>
> Here is the short story:
> Scenario is a cluster of 3 servers.  Server 1. crashed.  It was carrying the .META. 
 We split the logs.  .META. is put on the head of the assignment queue.  Server 2. happens
to be in a state where it wants to report a split.  The master fails the report because there
is no .META. (It fails it ugly with a NPE).  Server 3. checks in and falls into the assignment
code (RegionManager#regionsAwaitingAssignment).  In here we have this bit of code around line
#412:
> {code}
>     if (reassigningMetas && isMetaOrRoot && !isSingleServer) {
>       return regionsToAssign; // dont assign anything to this server.
>     }
> {code}
> Because we think this not a single server cluster -- we think there are two 'live' nodes
-- we won't assign meta.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message