hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Elliott Clark (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-6461) Killing the HRegionServer and DataNode hosting ROOT can result in a malformed root table.
Date Thu, 26 Jul 2012 23:07:35 GMT

    [ https://issues.apache.org/jira/browse/HBASE-6461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13423535#comment-13423535
] 

Elliott Clark commented on HBASE-6461:
--------------------------------------

bq.Any chance to repeat this with distributed log splitting disabled?
Sure gimme a little bit.  Reading the logs on my latest example.  Once JD and I are done (assuming
that we don't solve it) I'll try with distributed log splitting off.


On the latest run JD noticed that this in the logs of the server that was doing the splitting
(There was only one hlog nothing had rolled yet). 10.4.3.44 is the server that had all java
processes killed.
{noformat}
453-2012-07-26 22:13:18,079 WARN org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: File
hdfs://sv4r11s38:9901/hbase/.logs/sv4r3s44,9913,1343340207863-splitting/sv4r3s44%2C9913%2C1343340207863.1343340210132
might be still open, length is 0
454-2012-07-26 22:13:18,081 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: Recovering file
hdfs://sv4r11s38:9901/hbase/.logs/sv4r3s44,9913,1343340207863-splitting/sv4r3s44%2C9913%2C1343340207863.1343340210132
455-2012-07-26 22:13:19,083 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: Finished lease
recover attempt for hdfs://sv4r11s38:9901/hbase/.logs/sv4r3s44,9913,1343340207863-splitting/sv4r3s44%2C9913%2C1343340207863.1343340210132
456-2012-07-26 22:13:20,091 INFO org.apache.hadoop.ipc.Client: Retrying connect to server:
/10.4.3.44:9910. Already tried 0 time(s).
457-2012-07-26 22:13:21,092 INFO org.apache.hadoop.ipc.Client: Retrying connect to server:
/10.4.3.44:9910. Already tried 1 time(s).
458-2012-07-26 22:13:22,093 INFO org.apache.hadoop.ipc.Client: Retrying connect to server:
/10.4.3.44:9910. Already tried 2 time(s).
459-2012-07-26 22:13:23,093 INFO org.apache.hadoop.ipc.Client: Retrying connect to server:
/10.4.3.44:9910. Already tried 3 time(s).
460-2012-07-26 22:13:24,094 INFO org.apache.hadoop.ipc.Client: Retrying connect to server:
/10.4.3.44:9910. Already tried 4 time(s).
461-2012-07-26 22:13:25,095 INFO org.apache.hadoop.ipc.Client: Retrying connect to server:
/10.4.3.44:9910. Already tried 5 time(s).
462-2012-07-26 22:13:26,096 INFO org.apache.hadoop.ipc.Client: Retrying connect to server:
/10.4.3.44:9910. Already tried 6 time(s).
463-2012-07-26 22:13:27,096 INFO org.apache.hadoop.ipc.Client: Retrying connect to server:
/10.4.3.44:9910. Already tried 7 time(s).
464-2012-07-26 22:13:27,936 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats:
total=50.28 MB, free=5.94 GB, max=5.99 GB, blocks=0, accesses=0, hits=0, hitRatio=0cachingAccesses=0,
cachingHits=0, cachingHitsRatio=0evictions=0, evicted=0, evictedPerRun=NaN
465-2012-07-26 22:13:28,097 INFO org.apache.hadoop.ipc.Client: Retrying connect to server:
/10.4.3.44:9910. Already tried 8 time(s).
466-2012-07-26 22:13:29,098 INFO org.apache.hadoop.ipc.Client: Retrying connect to server:
/10.4.3.44:9910. Already tried 9 time(s).
467:2012-07-26 22:13:29,100 WARN org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Could
not open hdfs://sv4r11s38:9901/hbase/.logs/sv4r3s44,9913,1343340207863-splitting/sv4r3s44%2C9913%2C1343340207863.1343340210132
for reading. File is empty
468-java.io.EOFException
469-	at java.io.DataInputStream.readFully(DataInputStream.java:180)
470-	at java.io.DataInputStream.readFully(DataInputStream.java:152)
471-	at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1521)
472-	at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1493)
473-	at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1480)
474-	at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1475)
475-	at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.<init>(SequenceFileLogReader.java:55)
476-	at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.init(SequenceFileLogReader.java:175)
477-	at org.apache.hadoop.hbase.regionserver.wal.HLog.getReader(HLog.java:716)
478-	at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.getReader(HLogSplitter.java:821)
479-	at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.getReader(HLogSplitter.java:734)
480-	at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFile(HLogSplitter.java:381)
481-	at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFile(HLogSplitter.java:348)
482-	at org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:111)
483-	at org.apache.hadoop.hbase.regionserver.SplitLogWorker.grabTask(SplitLogWorker.java:264)
484-	at org.apache.hadoop.hbase.regionserver.SplitLogWorker.taskLoop(SplitLogWorker.java:195)
485-	at org.apache.hadoop.hbase.regionserver.SplitLogWorker.run(SplitLogWorker.java:163)
486-	at java.lang.Thread.run(Thread.java:662)
{noformat}

So the log splitter is trying the dead server over and over.  Then failing to read anything
so root is empty.
                
> Killing the HRegionServer and DataNode hosting ROOT can result in a malformed root table.
> -----------------------------------------------------------------------------------------
>
>                 Key: HBASE-6461
>                 URL: https://issues.apache.org/jira/browse/HBASE-6461
>             Project: HBase
>          Issue Type: Bug
>         Environment: hadoop-0.20.2-cdh3u3
> HBase 0.94.1 RC1
>            Reporter: Elliott Clark
>            Priority: Critical
>             Fix For: 0.94.2
>
>
> Spun up a new dfs on hadoop-0.20.2-cdh3u3
> Started hbase
> started running loadtest tool.
> killed rs and dn holding root with killall -9 java on server sv4r27s44 at about 2012-07-25
22:40:00
> After things stabilize Root is in a bad state. Ran hbck and got:
> Exception in thread "main" org.apache.hadoop.hbase.client.NoServerForRegionException:
No server address listed in -ROOT- for region .META.,,1.1028785192 containing row 
> at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:1016)
> at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:841)
> at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:810)
> at org.apache.hadoop.hbase.client.HTable.finishSetup(HTable.java:232)
> at org.apache.hadoop.hbase.client.HTable.<init>(HTable.java:172)
> at org.apache.hadoop.hbase.util.HBaseFsck.connect(HBaseFsck.java:241)
> at org.apache.hadoop.hbase.util.HBaseFsck.main(HBaseFsck.java:3236)
> hbase(main):001:0> scan '-ROOT-'
> ROW                                           COLUMN+CELL                           
                                                                                         
 
> 12/07/25 22:43:18 INFO security.UserGroupInformation: JAAS Configuration already set
up for Hadoop, not re-installing.
>  .META.,,1                                    column=info:regioninfo, timestamp=1343255838525,
value={NAME => '.META.,,1', STARTKEY => '', ENDKEY => '', ENCODED => 1028785192,}
>  .META.,,1                                    column=info:v, timestamp=1343255838525,
value=\x00\x00                                                                           

> 1 row(s) in 0.5930 seconds
> Here's the master log: https://gist.github.com/3179194
> I tried the same thing with 0.92.1 and I was able to get into a similar situation, so
I don't think this is anything new. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message