Mailing-List: contact hbase-dev-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hbase-dev@hadoop.apache.org
Message-ID: <1726849278.1206996864975.JavaMail.jira@brutus>
Date: Mon, 31 Mar 2008 13:54:24 -0700 (PDT)
From: "stack (JIRA)" <jira@apache.org>
To: hbase-dev@hadoop.apache.org
Subject: [jira] Updated: (HBASE-551) Master stuck splitting server logs in
 shutdown loop; on each iteration, edits are aggregated up into the millions
In-Reply-To: <930935855.1206828084716.JavaMail.jira@brutus>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


     [ https://issues.apache.org/jira/browse/HBASE-551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HBASE-551:
------------------------

    Resolution: Fixed
        Status: Resolved  (was: Patch Available)

Applied branch and TRUNK.

> Master stuck splitting server logs in shutdown loop; on each iteration, edits are aggregated up into the millions
> -----------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-551
>                 URL: https://issues.apache.org/jira/browse/HBASE-551
>             Project: Hadoop HBase
>          Issue Type: Bug
>    Affects Versions: 0.2.0, 0.1.1, 0.1.0
>            Reporter: stack
>            Assignee: stack
>            Priority: Blocker
>             Fix For: 0.2.0, 0.1.1
>
>         Attachments: 551.patch
>
>
> Lars cluster is sick with master trying to split logs.   The logs its replaying have millions of edits in them.  
> Here is sample from log.   First we get the shutdown and then in the shutdown process, we start to split up the shutdown servers log:
> {code}
> 2008-03-28 16:29:45,305 INFO org.apache.hadoop.hbase.HMaster: process shutdown of server 192.168.105.37:60020: logSplit: false, rootRes
> canned: false, numberOfMetaRegions: 1, onlineMetaRegions.size(): 1
> 2008-03-28 16:29:45,310 INFO org.apache.hadoop.hbase.HLog: splitting 3 log(s) in hdfs://lv1-xen-pdc-2.worldlingo.com:9000/hbase/log_192
> .168.105.37_1206741382563_60020
> 2008-03-28 16:29:45,311 DEBUG org.apache.hadoop.hbase.HLog: Splitting 0 of 3: hdfs://lv1-xen-pdc-2.worldlingo.com:9000/hbase/log_192.16
> 8.105.37_1206741382563_60020/hlog.dat.002
> 2008-03-28 16:29:45,380 DEBUG org.apache.hadoop.hbase.HLog: Creating new log file writer for path hdfs://lv1-xen-pdc-2.worldlingo.com:9
> 000/hbase/pdc-docs/488338803/oldlogfile.log and region pdc-docs,EP01108687NWA2,1205739919655
> 2008-03-28 16:29:45,390 DEBUG org.apache.hadoop.hbase.HLog: Creating new log file writer for path hdfs://lv1-xen-pdc-2.worldlingo.com:9
> 000/hbase/pdc-docs/447465883/oldlogfile.log and region pdc-docs,EP01900680NWA1,1205754584444
> 2008-03-28 16:29:45,403 DEBUG org.apache.hadoop.hbase.HLog: Creating new log file writer for path hdfs://lv1-xen-pdc-2.worldlingo.com:9
> 000/hbase/pdc-docs/2035706226/oldlogfile.log and region pdc-docs,EP01119588NWA2,1205754281917
> 2008-03-28 16:29:45,428 DEBUG org.apache.hadoop.hbase.HLog: Creating new log file writer for path hdfs://lv1-xen-pdc-2.worldlingo.com:9
> 000/hbase/pdc-docs/437772136/oldlogfile.log and region pdc-docs,EP00200190NWA2,120576451593
> ...
> {code}
> We open a file in each region to take edits.  We then start replaying the 3 WAL files from the regionserver.
> On the second one, we get exception... 
> {code}
> 2008-03-28 16:40:36,537 WARN org.apache.hadoop.hbase.HLog: Old log file hdfs://lv1-xen-pdc-2.worldlingo.com:9000/hbase/pdc-docs/1045858
> 46/oldlogfile.log already exists. Copying existing file to new file
> 2008-03-28 16:40:36,545 DEBUG org.apache.hadoop.hbase.HLog: Creating new log file writer for path hdfs://lv1-xen-pdc-2.worldlingo.com:9
> 000/hbase/pdc-docs/104585846/oldlogfile.log and region pdc-docs,EP96104830NWA1,1205768785572
> 2008-03-28 16:40:36,979 DEBUG org.apache.hadoop.hbase.HLog: Copied 220000 edits
> 2008-03-28 16:40:38,853 DEBUG org.apache.hadoop.hbase.HLog: Applied 222812 total edits
> 2008-03-28 16:40:38,853 DEBUG org.apache.hadoop.hbase.HLog: Splitting 1 of 3: hdfs://lv1-xen-pdc-2.worldlingo.com:9000/hbase/log_192.16
> 8.105.37_1206741382563_60020/hlog.dat.003
> 2008-03-28 16:40:56,883 WARN org.apache.hadoop.hbase.HLog: Old log file hdfs://lv1-xen-pdc-2.worldlingo.com:9000/hbase/pdc-docs/2118067
> 194/oldlogfile.log already exists. Copying existing file to new file
> 2008-03-28 16:40:56,891 DEBUG org.apache.hadoop.hbase.HLog: Creating new log file writer for path hdfs://lv1-xen-pdc-2.worldlingo.com:9
> 000/hbase/pdc-docs/2118067194/oldlogfile.log and region pdc-docs,EP97302517NWA2,1205726201776
> 2008-03-28 16:41:12,910 DEBUG org.apache.hadoop.hbase.HLog: Applied 36638 total edits
> 2008-03-28 16:41:12,910 DEBUG org.apache.hadoop.hbase.HLog: Splitting 2 of 3: hdfs://lv1-xen-pdc-2.worldlingo.com:9000/hbase/log_192.16
> 8.105.37_1206741382563_60020/hlog.dat.004
> 2008-03-28 16:41:18,684 WARN org.apache.hadoop.hbase.HMaster: Processing pending operations: ProcessServerShutdown of 192.168.105.37:60
> 020
> java.io.EOFException
>         at java.io.DataInputStream.readFully(DataInputStream.java:178)
>         at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:56)
>         at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:90)
>         at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1829)
>         at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1729)
>         at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1775)
>         at org.apache.hadoop.hbase.HLog.splitLog(HLog.java:540)
>         at org.apache.hadoop.hbase.HMaster$ProcessServerShutdown.process(HMaster.java:2167)
>         at org.apache.hadoop.hbase.HMaster.run(HMaster.java:1085)
> {code}
> A finally clause makes sure we close up all the new files we've made in all regions.  These new files have accumulated some edits from the splitting of the first file.
> Because we got an exception, the shutdown processing runs again.
> Because regions have files in place with edits, we won't overwrite them second time through.  We instead copy the old into a new file to which we start appending until the exception happens again.
> After a couple of hours, we're up into the millions of edits.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.