hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "stack (JIRA)" <j...@apache.org>
Subject [jira] Updated: (HBASE-551) Master stuck splitting server logs in shutdown loop; on each iteration, edits are aggregated up into the millions
Date Sat, 29 Mar 2008 22:19:25 GMT

     [ https://issues.apache.org/jira/browse/HBASE-551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

stack updated HBASE-551:
------------------------

    Attachment: 551.patch

M  src/java/org/apache/hadoop/hbase/HLog.java
    (splitLog): If an exception processing a split, catch it.
    In finally, close and delete the split.  Don't try retrying.
    While in some circumstance, we might recover, its also
    likely that we just get same exception again.  If so, and
    multiple files, we'll just accumulate edits until the
    kingdom comes.

> Master stuck splitting server logs in shutdown loop; on each iteration, edits are aggregated
up into the millions
> -----------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-551
>                 URL: https://issues.apache.org/jira/browse/HBASE-551
>             Project: Hadoop HBase
>          Issue Type: Bug
>            Reporter: stack
>            Priority: Blocker
>         Attachments: 551.patch
>
>
> Lars cluster is sick with master trying to split logs.   The logs its replaying have
millions of edits in them.  
> Here is sample from log.   First we get the shutdown and then in the shutdown process,
we start to split up the shutdown servers log:
> {code}
> 2008-03-28 16:29:45,305 INFO org.apache.hadoop.hbase.HMaster: process shutdown of server
192.168.105.37:60020: logSplit: false, rootRes
> canned: false, numberOfMetaRegions: 1, onlineMetaRegions.size(): 1
> 2008-03-28 16:29:45,310 INFO org.apache.hadoop.hbase.HLog: splitting 3 log(s) in hdfs://lv1-xen-pdc-2.worldlingo.com:9000/hbase/log_192
> .168.105.37_1206741382563_60020
> 2008-03-28 16:29:45,311 DEBUG org.apache.hadoop.hbase.HLog: Splitting 0 of 3: hdfs://lv1-xen-pdc-2.worldlingo.com:9000/hbase/log_192.16
> 8.105.37_1206741382563_60020/hlog.dat.002
> 2008-03-28 16:29:45,380 DEBUG org.apache.hadoop.hbase.HLog: Creating new log file writer
for path hdfs://lv1-xen-pdc-2.worldlingo.com:9
> 000/hbase/pdc-docs/488338803/oldlogfile.log and region pdc-docs,EP01108687NWA2,1205739919655
> 2008-03-28 16:29:45,390 DEBUG org.apache.hadoop.hbase.HLog: Creating new log file writer
for path hdfs://lv1-xen-pdc-2.worldlingo.com:9
> 000/hbase/pdc-docs/447465883/oldlogfile.log and region pdc-docs,EP01900680NWA1,1205754584444
> 2008-03-28 16:29:45,403 DEBUG org.apache.hadoop.hbase.HLog: Creating new log file writer
for path hdfs://lv1-xen-pdc-2.worldlingo.com:9
> 000/hbase/pdc-docs/2035706226/oldlogfile.log and region pdc-docs,EP01119588NWA2,1205754281917
> 2008-03-28 16:29:45,428 DEBUG org.apache.hadoop.hbase.HLog: Creating new log file writer
for path hdfs://lv1-xen-pdc-2.worldlingo.com:9
> 000/hbase/pdc-docs/437772136/oldlogfile.log and region pdc-docs,EP00200190NWA2,120576451593
> ...
> {code}
> We open a file in each region to take edits.  We then start replaying the 3 WAL files
from the regionserver.
> On the second one, we get exception... 
> {code}
> 2008-03-28 16:40:36,537 WARN org.apache.hadoop.hbase.HLog: Old log file hdfs://lv1-xen-pdc-2.worldlingo.com:9000/hbase/pdc-docs/1045858
> 46/oldlogfile.log already exists. Copying existing file to new file
> 2008-03-28 16:40:36,545 DEBUG org.apache.hadoop.hbase.HLog: Creating new log file writer
for path hdfs://lv1-xen-pdc-2.worldlingo.com:9
> 000/hbase/pdc-docs/104585846/oldlogfile.log and region pdc-docs,EP96104830NWA1,1205768785572
> 2008-03-28 16:40:36,979 DEBUG org.apache.hadoop.hbase.HLog: Copied 220000 edits
> 2008-03-28 16:40:38,853 DEBUG org.apache.hadoop.hbase.HLog: Applied 222812 total edits
> 2008-03-28 16:40:38,853 DEBUG org.apache.hadoop.hbase.HLog: Splitting 1 of 3: hdfs://lv1-xen-pdc-2.worldlingo.com:9000/hbase/log_192.16
> 8.105.37_1206741382563_60020/hlog.dat.003
> 2008-03-28 16:40:56,883 WARN org.apache.hadoop.hbase.HLog: Old log file hdfs://lv1-xen-pdc-2.worldlingo.com:9000/hbase/pdc-docs/2118067
> 194/oldlogfile.log already exists. Copying existing file to new file
> 2008-03-28 16:40:56,891 DEBUG org.apache.hadoop.hbase.HLog: Creating new log file writer
for path hdfs://lv1-xen-pdc-2.worldlingo.com:9
> 000/hbase/pdc-docs/2118067194/oldlogfile.log and region pdc-docs,EP97302517NWA2,1205726201776
> 2008-03-28 16:41:12,910 DEBUG org.apache.hadoop.hbase.HLog: Applied 36638 total edits
> 2008-03-28 16:41:12,910 DEBUG org.apache.hadoop.hbase.HLog: Splitting 2 of 3: hdfs://lv1-xen-pdc-2.worldlingo.com:9000/hbase/log_192.16
> 8.105.37_1206741382563_60020/hlog.dat.004
> 2008-03-28 16:41:18,684 WARN org.apache.hadoop.hbase.HMaster: Processing pending operations:
ProcessServerShutdown of 192.168.105.37:60
> 020
> java.io.EOFException
>         at java.io.DataInputStream.readFully(DataInputStream.java:178)
>         at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:56)
>         at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:90)
>         at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1829)
>         at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1729)
>         at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1775)
>         at org.apache.hadoop.hbase.HLog.splitLog(HLog.java:540)
>         at org.apache.hadoop.hbase.HMaster$ProcessServerShutdown.process(HMaster.java:2167)
>         at org.apache.hadoop.hbase.HMaster.run(HMaster.java:1085)
> {code}
> A finally clause makes sure we close up all the new files we've made in all regions.
 These new files have accumulated some edits from the splitting of the first file.
> Because we got an exception, the shutdown processing runs again.
> Because regions have files in place with edits, we won't overwrite them second time through.
 We instead copy the old into a new file to which we start appending until the exception happens
again.
> After a couple of hours, we're up into the millions of edits.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message