hbase-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From st...@apache.org
Subject git commit: HBASE-11280 Document distributed log replay and distributed log splitting (Misty Stanley-Jones)
Date Tue, 17 Jun 2014 23:18:20 GMT
Repository: hbase
Updated Branches:
  refs/heads/master 1324a3cb1 -> 1b9242259

HBASE-11280 Document distributed log replay and distributed log splitting (Misty Stanley-Jones)

Project: http://git-wip-us.apache.org/repos/asf/hbase/repo
Commit: http://git-wip-us.apache.org/repos/asf/hbase/commit/1b924225
Tree: http://git-wip-us.apache.org/repos/asf/hbase/tree/1b924225
Diff: http://git-wip-us.apache.org/repos/asf/hbase/diff/1b924225

Branch: refs/heads/master
Commit: 1b9242259704d5c6a653a7f56cb1121b981ae3da
Parents: 1324a3c
Author: Michael Stack <stack@apache.org>
Authored: Tue Jun 17 18:18:00 2014 -0500
Committer: Michael Stack <stack@apache.org>
Committed: Tue Jun 17 18:18:00 2014 -0500

 src/main/docbkx/book.xml | 384 ++++++++++++++++++++++++++++++++++++++----
 1 file changed, 351 insertions(+), 33 deletions(-)

diff --git a/src/main/docbkx/book.xml b/src/main/docbkx/book.xml
index 3e22dfd..1af243c 100644
--- a/src/main/docbkx/book.xml
+++ b/src/main/docbkx/book.xml
@@ -2200,16 +2200,21 @@ rs.close();
-          <para>Each RegionServer adds updates (Puts, Deletes) to its write-ahead log
(WAL) first,
-            and then to the <xref
-              linkend="store.memstore" /> for the affected <xref
-              linkend="store" />. This ensures that HBase has durable writes. Without
WAL, there is
-            the possibility of data loss in the case of a RegionServer failure before each
-            is flushed and new StoreFiles are written. <link
+          <para>The <firstterm>Write Ahead Log (WAL)</firstterm> records
all changes to data in
+            HBase, to file-based storage. Under normal operations, the WAL is not needed
+            data changes move from the MemStore to StoreFiles. However, if a RegionServer
crashes or
+          becomes unavailable before the MemStore is flushed, the WAL ensures that the changes
+          the data can be replayed. If writing to the WAL fails, the entire operation to
modify the
+          data fails.</para>
+          <para>HBase uses an implementation of the <link
-            is the HBase WAL implementation, and there is one HLog instance per RegionServer.
-          <para>The WAL is in HDFS in <filename>/hbase/.logs/</filename>
with subdirectories per
+            interface for the WAL.
+            Usually, there is only one instance of a WAL per RegionServer. The RegionServer
records Puts and Deletes to
+            it, before recording them to the <xref
+              linkend="store.memstore" /> for the affected <xref
+              linkend="store" />.</para>
+          <para>The WAL resides in HDFS in the <filename>/hbase/WALs/</filename>
directory (prior to
+            HBase 0.94, they were stored in <filename>/hbase/.logs/</filename>),
with subdirectories per
           <para> For more general information about the concept of write ahead logs,
see the
             Wikipedia <link
@@ -2226,39 +2231,352 @@ rs.close();
           <title>WAL Splitting</title>
+          <para>A RegionServer serves many regions. All of the regions in a region
server share the
+            same active WAL file. Each edit in the WAL file includes information about which
+            it belongs to. When a region is opened, the edits in the WAL file which belong
to that
+            region need to be replayed. Therefore, edits in the WAL file must be grouped
by region
+            so that particular sets can be replayed to regenerate the data in a particular
+            The process of grouping the WAL edits by region is called <firstterm>log
+              splitting</firstterm>. It is a critical process for recovering data if
a region server
+            fails.</para>
+          <para>Log splitting is done by the HMaster during cluster start-up or by
the ServerShutdownHandler
+            as a region server shuts down. So that consistency is guaranteed, affected regions
+            are unavailable until data is restored. All WAL edits need to be recovered and
+            before a given region can become available again. As a result, regions affected
+            log splitting are unavailable until the process completes.</para>
+          <procedure xml:id="log.splitting.step.by.step">
+            <title>Log Splitting, Step by Step</title>
+            <step>
+              <title>The <filename>/hbase/WALs/&lt;host>,&lt;port>,&lt;startcode></filename>
directory is renamed.</title>
+              <para>Renaming the directory is important because a RegionServer may
still be up and
+                accepting requests even if the HMaster thinks it is down. If the RegionServer
+                not respond immediately and does not heartbeat its ZooKeeper session, the
+                may interpret this as a RegionServer failure. Renaming the logs directory
+                that existing, valid WAL files which are still in use by an active but busy
+                RegionServer are not written to by accident.</para>
+              <para>The new directory is named according to the following pattern:</para>
+              <screen><![CDATA[/hbase/WALs/<host>,<port>,<startcode>-splitting]]></screen>
+              <para>An example of such a renamed directory might look like the following:</para>
+              <screen>/hbase/WALs/srv.example.com,60020,1254173957298-splitting</screen>
+            </step>
+            <step>
+              <title>Each log file is split, one at a time.</title>
+              <para>The log splitter reads the log file one edit entry at a time and
puts each edit
+                entry into the buffer corresponding to the edit’s region. At the same time,
+                splitter starts several writer threads. Writer threads pick up a corresponding
+                buffer and write the edit entries in the buffer to a temporary recovered
+                file. The temporary edit file is stored to disk with the following naming
+              <screen><![CDATA[/hbase/<table_name>/<region_id>/recovered.edits/.temp]]></screen>
+              <para>This file is used to store all the edits in the WAL log for this
region. After
+                log splitting completes, the <filename>.temp</filename> file
is renamed to the
+                sequence ID of the first log written to the file.</para>
+              <para>To determine whether all edits have been written, the sequence
ID is compared to
+                the sequence of the last edit that was written to the HFile. If the sequence
of the
+                last edit is greater than or equal to the sequence ID included in the file
name, it
+                is clear that all writes from the edit file have been completed.</para>
+            </step>
+            <step>
+              <title>After log splitting is complete, each affected region is assigned
to a
+                RegionServer.</title>
+              <para> When the region is opened, the <filename>recovered.edits</filename>
folder is checked for recovered
+                edits files. If any such files are present, they are replayed by reading
the edits
+                and saving them to the MemStore. After all edit files are replayed, the contents
+                the MemStore are written to disk (HFile) and the edit files are deleted.</para>
+            </step>
+          </procedure>
-            <title>How edits are recovered from a crashed RegionServer</title>
-            <para>When a RegionServer crashes, it will lose its ephemeral lease in
-              ZooKeeper...TODO</para>
-          </section>
-          <section>
-            <title><varname>hbase.hlog.split.skip.errors</varname></title>
+            <title>Handling of Errors During Log Splitting</title>
-            <para>When set to <constant>true</constant>, any error encountered
splitting will be
-              logged, the problematic WAL will be moved into the <filename>.corrupt</filename>
-              directory under the hbase <varname>rootdir</varname>, and processing
will continue. If
-              set to <constant>false</constant>, the default, the exception will
be propagated and
-              the split logged as failed.<footnote>
+            <para>If you set the <varname>hbase.hlog.split.skip.errors</varname>
option to
+                <constant>true</constant>, errors are treated as follows:</para>
+            <itemizedlist>
+              <listitem>
+                <para>Any error encountered during splitting will be logged.</para>
+              </listitem>
+              <listitem>
+                <para>The problematic WAL log will be moved into the <filename>.corrupt</filename>
+                  directory under the hbase <varname>rootdir</varname>,</para>
+              </listitem>
+              <listitem>
+                <para>Processing of the WAL will continue</para>
+              </listitem>
+            </itemizedlist>
+            <para>If the <varname>hbase.hlog.split.skip.errors</varname>
optionset to
+                <literal>false</literal>, the default, the exception will be
propagated and the
+              split will be logged as failed.<footnote>
                 <para>See <link
                     hbase.hlog.split.skip.errors is set to false, we fail the split but thats
                     it</link>. We need to do more than just fail split if this flag
is set.</para>
+            <section>
+              <title>How EOFExceptions are treated when splitting a crashed RegionServers'
+                WALs</title>
+              <para>If an EOFException occurs while splitting logs, the split proceeds
even when
+                  <varname>hbase.hlog.split.skip.errors</varname> is set to
+                <literal>false</literal>. An EOFException while reading the last
log in the set of
+                files to split is likely, because the RegionServer is likely to be in the
process of
+                writing a record at the time of a crash. <footnote>
+                  <para>For background, see <link
+                      xlink:href="https://issues.apache.org/jira/browse/HBASE-2643">HBASE-2643
+                      Figure how to deal with eof splitting logs</link></para>
+                </footnote></para>
+            </section>
-            <title>How EOFExceptions are treated when splitting a crashed RegionServers'
-              WALs</title>
-            <para>If we get an EOF while splitting logs, we proceed with the split
even when
-                <varname>hbase.hlog.split.skip.errors</varname> == <constant>false</constant>.
-              EOF while reading the last log in the set of files to split is near-guaranteed
-              the RegionServer likely crashed mid-write of a record. But we'll continue even
if we
-              got an EOF reading other than the last file in the set.<footnote>
-                <para>For background, see <link
-                    xlink:href="https://issues.apache.org/jira/browse/HBASE-2643">HBASE-2643
-                    how to deal with eof splitting logs</link></para>
-              </footnote></para>
+            <title>Performance Improvements during Log Splitting</title>
+            <para>
+              WAL log splitting and recovery can be resource intensive and take a long time,
+              depending on the number of RegionServers involved in the crash and the size
of the
+              regions. <xref linkend="distributed.log.splitting" /> and <xref
+                linkend="distributed.log.replay" /> were developed to improve
+              performance during log splitting.
+            </para>
+            <section xml:id="distributed.log.splitting">
+              <title>Distributed Log Splitting</title>
+              <para><firstterm>Distributed Log Splitting</firstterm> was
added in HBase version 0.92
+                (<link xlink:href="https://issues.apache.org/jira/browse/HBASE-1364">HBASE-1364</link>)

+                by Prakash Khemani from Facebook. It reduces the time to complete log splitting
+                dramatically, improving the availability of regions and tables. For
+                example, recovering a crashed cluster took around 9 hours with single-threaded
+                splitting, but only about six minutes with distributed log splitting.</para>
+              <para>The information in this section is sourced from Jimmy Xiang's blog
post at <link
+              xlink:href="http://blog.cloudera.com/blog/2012/07/hbase-log-splitting/" />.</para>
+              <formalpara>
+                <title>Enabling or Disabling Distributed Log Splitting</title>
+                <para>Distributed log processing is enabled by default since HBase
0.92. The setting
+                  is controlled by the <property>hbase.master.distributed.log.splitting</property>
+                  property, which can be set to <literal>true</literal> or <literal>false</literal>,
+                  but defaults to <literal>true</literal>. </para>
+              </formalpara>
+              <procedure>
+                <title>Distributed Log Splitting, Step by Step</title>
+                <para>After configuring distributed log splitting, the HMaster controls
the process.
+                  The HMaster enrolls each RegionServer in the log splitting process, and
the actual
+                  work of splitting the logs is done by the RegionServers. The general process
+                  log splitting, as described in <xref
+                    linkend="log.splitting.step.by.step" /> still applies here.</para>
+                <step>
+                  <para>If distributed log processing is enabled, the HMaster creates
+                    <firstterm>split log manager</firstterm> instance when the
cluster is started.
+                    The split log manager manages all log files which need
+                    to be scanned and split. The split log manager places all the logs into
+                    ZooKeeper splitlog node (<filename>/hbase/splitlog</filename>)
as tasks. You can
+                  view the contents of the splitlog by issuing the following
+                    <command>zkcli</command> command. Example output is shown.</para>
+                  <screen>ls /hbase/splitlog


+                  </screen>
+                  <para>The output contains some non-ASCII characters. When decoded,
it looks much
+                    more simple:</para>
+                  <screen>
+                  </screen>
+                  <para>The listing represents WAL file names to be scanned and split,
which is a
+                    list of log splitting tasks.</para>
+                </step>
+                <step>
+                  <title>The split log manager monitors the log-splitting tasks and
+                  <para>The split log manager is responsible for the following ongoing
+                  <itemizedlist>
+                    <listitem>
+                      <para>Once the split log manager publishes all the tasks to the
+                        znode, it monitors these task nodes and waits for them to be
+                        processed.</para>
+                    </listitem>
+                    <listitem>
+                      <para>Checks to see if there are any dead split log
+                        workers queued up. If it finds tasks claimed by unresponsive workers,
+                        will resubmit those tasks. If the resubmit fails due to some ZooKeeper
+                        exception, the dead worker is queued up again for retry.</para>
+                    </listitem>
+                    <listitem>
+                      <para>Checks to see if there are any unassigned
+                        tasks. If it finds any, it create an ephemeral rescan node so that
+                        split log worker is notified to re-scan unassigned tasks via the
+                          <code>nodeChildrenChanged</code> ZooKeeper event.</para>
+                    </listitem>
+                    <listitem>
+                      <para>Checks for tasks which are assigned but expired. If any
are found, they
+                        are moved back to <code>TASK_UNASSIGNED</code> state
again so that they can
+                        be retried. It is possible that these tasks are assigned to slow
workers, or
+                        they may already be finished. This is not a problem, because log
+                        tasks have the property of idempotence. In other words, the same
+                        splitting task can be processed many times without causing any
+                        problem.</para>
+                    </listitem>
+                    <listitem>
+                      <para>The split log manager watches the HBase split log znodes
constantly. If
+                        any split log task node data is changed, the split log manager retrieves
+                        node data. The
+                        node data contains the current state of the task. You can use the
+                        <command>zkcli</command> <command>get</command>
command to retrieve the
+                        current state of a task. In the example output below, the first line
of the
+                        output shows that the task is currently unassigned.</para>
+                      <screen>
+<userinput>get /hbase/splitlog/hdfs%3A%2F%2Fhost2.sample.com%3A56020%2Fhbase%2F.logs%2Fhost6.sample.com%2C57020%2C1340474893287-splitting%2Fhost6.sample.com%253A57020.1340474893945
+<computeroutput>unassigned host2.sample.com:57000
+cZxid = 0×7115
+ctime = Sat Jun 23 11:13:40 PDT 2012
+                      </screen>
+                      <para>Based on the state of the task whose data is changed, the
split log
+                        manager does one of the following:</para>
+                      <itemizedlist>
+                        <listitem>
+                          <para>Resubmit the task if it is unassigned</para>
+                        </listitem>
+                        <listitem>
+                          <para>Heartbeat the task if it is assigned</para>
+                        </listitem>
+                        <listitem>
+                          <para>Resubmit or fail the task if it is resigned (see <xref
+                            linkend="distributed.log.replay.failure.reasons" />)</para>
+                        </listitem>
+                        <listitem>
+                          <para>Resubmit or fail the task if it is completed with errors
(see <xref
+                            linkend="distributed.log.replay.failure.reasons" />)</para>
+                        </listitem>
+                        <listitem>
+                          <para>Resubmit or fail the task if it could not complete
due to
+                            errors (see <xref
+                            linkend="distributed.log.replay.failure.reasons" />)</para>
+                        </listitem>
+                        <listitem>
+                          <para>Delete the task if it is successfully completed or
+                        </listitem>
+                      </itemizedlist>
+                      <itemizedlist xml:id="distributed.log.replay.failure.reasons">
+                        <title>Reasons a Task Will Fail</title>
+                        <listitem><para>The task has been deleted.</para></listitem>
+                        <listitem><para>The node no longer exists.</para></listitem>
+                        <listitem><para>The log status manager failed to move
the state of the task
+                          to TASK_UNASSIGNED.</para></listitem>
+                        <listitem><para>The number of resubmits is over the resubmit
+                          threshold.</para></listitem>
+                      </itemizedlist>
+                    </listitem>
+                  </itemizedlist>
+                </step>
+                <step>
+                  <title>Each RegionServer's split log worker performs the log-splitting
+                  <para>Each RegionServer runs a daemon thread called the <firstterm>split
+                      worker</firstterm>, which does the work to split the logs. The
daemon thread
+                    starts when the RegionServer starts, and registers itself to watch HBase
+                    If any splitlog znode children change, it notifies a sleeping worker
thread to
+                    wake up and grab more tasks. If if a worker's current task’s node data
+                    changed, the worker checks to see if the task has been taken by another
+                    If so, the worker thread stops work on the current task.</para>
+                  <para>The worker monitors
+                    the splitlog znode constantly. When a new task appears, the split log
+                    retrieves  the task paths and checks each one until it finds an unclaimed
+                    which it attempts to claim. If the claim was successful, it attempts
to perform
+                    the task and updates the task's <property>state</property>
property based on the
+                    splitting outcome. At this point, the split log worker scans for another
+                    unclaimed task.</para>
+                  <itemizedlist>
+                    <title>How the Split Log Worker Approaches a Task</title>
+                    <listitem>
+                      <para>It queries the task state and only takes action if the
task is in
+                          <literal>TASK_UNASSIGNED </literal>state.</para>
+                    </listitem>
+                    <listitem>
+                      <para>If the task is is in <literal>TASK_UNASSIGNED</literal>
state, the
+                        worker attempts to set the state to <literal>TASK_OWNED</literal>
by itself.
+                        If it fails to set the state, another worker will try to grab it.
The split
+                        log manager will also ask all workers to rescan later if the task
+                        unassigned.</para>
+                    </listitem>
+                    <listitem>
+                      <para>If the worker succeeds in taking ownership of the task,
it tries to get
+                        the task state again to make sure it really gets it asynchronously.
In the
+                        meantime, it starts a split task executor to do the actual work:
+                      <itemizedlist>
+                        <listitem>
+                          <para>Get the HBase root folder, create a temp folder under
the root, and
+                            split the log file to the temp folder.</para>
+                        </listitem>
+                        <listitem>
+                          <para>If the split was successful, the task executor sets
the task to
+                            state <literal>TASK_DONE</literal>.</para>
+                        </listitem>
+                        <listitem>
+                          <para>If the worker catches an unexpected IOException, the
task is set to
+                            state <literal>TASK_ERR</literal>.</para>
+                        </listitem>
+                        <listitem>
+                          <para>If the worker is shutting down, set the the task to
+                              <literal>TASK_RESIGNED</literal>.</para>
+                        </listitem>
+                        <listitem>
+                          <para>If the task is taken by another worker, just log it.</para>
+                        </listitem>
+                      </itemizedlist>
+                    </listitem>
+                  </itemizedlist>
+                </step>
+                <step>
+                  <title>The split log manager monitors for uncompleted tasks.</title>
+                  <para>The split log manager returns when all tasks are completed
successfully. If
+                    all tasks are completed with some failures, the split log manager throws
+                    exception so that the log splitting can be retried. Due to an asynchronous
+                    implementation, in very rare cases, the split log manager loses track
of some
+                    completed tasks. For that reason, it periodically checks for remaining
+                    uncompleted task in its task map or ZooKeeper. If none are found, it
throws an
+                    exception so that the log splitting can be retried right away instead
of hanging
+                    there waiting for something that won’t happen.</para>
+                </step>
+              </procedure>
+            </section>
+            <section xml:id="distributed.log.replay">
+              <title>Distributed Log Replay</title>
+              <para>After a RegionServer fails, its failed region is assigned to another
+                RegionServer, which is marked as "recovering" in ZooKeeper. A split log worker
+                replays edits from the WAL of the failed region server to the region at its
+                location. When a region is in "recovering" state, it can accept writes but
no reads
+                (including Append and Increment), region splits or merges. </para>
+              <para>Distributed Log Replay extends the <xref linkend="distributed.log.splitting"
/> framework. It works by
+                directly replaying WAL edits to another RegionServer instead of creating
+                  <filename>recovered.edits</filename> files. It provides the
following advantages
+                over distributed log splitting alone:</para>
+              <itemizedlist>
+                <listitem><para>It eliminates the overhead of writing and reading
a large number of
+                  <filename>recovered.edits</filename> files. It is not unusual
for thousands of
+                  <filename>recovered.edits</filename> files to be created and
written concurrently
+                  during a RegionServer recovery. Many small random writes can degrade overall
+                  system performance.</para></listitem>
+                <listitem><para>It allows writes even when a region is in recovering
state. It only takes seconds for a recovering region to accept writes again. 
+              </itemizedlist>
+              <formalpara>
+                <title>Enabling Distributed Log Replay</title>
+                <para>To enable distributed log replay, set <varname>hbase.master.distributed.log.replay</varname>
+                  true. This will be the default for HBase 0.99 (<link
+                    xlink:href="https://issues.apache.org/jira/browse/HBASE-10888">HBASE-10888</link>).</para>
+              </formalpara>
+              <para>You must also enable HFile version 3 (which is the default HFile
format starting
+                in HBase 0.99. See <link
+                  xlink:href="https://issues.apache.org/jira/browse/HBASE-10855">HBASE-10855</link>).
+                Distributed log replay is unsafe for rolling upgrades.</para>
+            </section>

View raw message