hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "jiraposter@reviews.apache.org (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-4222) Make HLog more resilient to write pipeline failures
Date Sat, 20 Aug 2011 00:37:28 GMT

    [ https://issues.apache.org/jira/browse/HBASE-4222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13088091#comment-13088091
] 

jiraposter@reviews.apache.org commented on HBASE-4222:
------------------------------------------------------



bq.  On 2011-08-19 18:58:21, Michael Stack wrote:
bq.  >

Will post an update with a default setting of 2 in hbase-default.xml and some fixes to TestLogRolling
-- my additional test is not playing nicely with the HBASE-4095 changes there at the moment.


bq.  On 2011-08-19 18:58:21, Michael Stack wrote:
bq.  > src/main/java/org/apache/hadoop/hbase/regionserver/LogRoller.java, line 87
bq.  > <https://reviews.apache.org/r/1590/diff/1/?file=33750#file33750line87>
bq.  >
bq.  >     How do you manually roll a log?  I want that.

Probably wouldn't be too hard to add a RPC call and shell command to manually trigger a roll.
 That would be nice to have, but I'll leave it for a separate issue.

(Log message just means triggered by HLog.requestLogRoll(), meaning from an IOException, or
current log size, or replica count below threshold).


- Gary


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/1590/#review1557
-----------------------------------------------------------


On 2011-08-19 18:33:11, Gary Helmling wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/1590/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2011-08-19 18:33:11)
bq.  
bq.  
bq.  Review request for hbase.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  This patch corrects a few problems, as I see it, with the current log rolling process:
bq.  
bq.  1) HLog.LogSyncer.run() now handles an IOException in the inner while loop.  Previously
any IOException would cause the LogSyncer thread to exit, even if the subsequent log roll
succeeded.  This would mean the region server kept running without a LogSyncer thread
bq.  2) Log rolls triggered by IOExceptions were being skipped in the event that there were
no entries in the log.  This would prevent the log from being recovered in a timely manner.
bq.  3) minor - FailedLogCloseException was never actually being thrown out of HLog.cleanupCurrentWriter(),
resulting in inaccurate logging on RS abort
bq.  
bq.  The bigger change is the addition of a configuration property -- hbase.regionserver.logroll.errors.tolerated
-- that is checked against a counter of consecutive close errors to see whether or not an
abort should be triggered.
bq.  
bq.  Prior to this patch, we could readily trigger region server aborts by rolling all the
data nodes in a cluster while region servers were running.  This was equally true whether
write activity was happening or not.  (In fact I think having concurrent write activity actually
gave a better chance for the log to be rolled prior to all DNs in the write pipeline going
down and thus the region server not aborting).
bq.  
bq.  With this change and hbase.regionserver.logroll.errors.tolerated=2, I can roll DNs at
will without causing any loss of service.
bq.  
bq.  I'd appreciate some scrutiny on any log rolling subtleties or interactions I may be missing
here.  If there are alternate/better ways to handle this in the DFSClient layer, I'd also
appreciate any pointers to that.
bq.  
bq.  
bq.  This addresses bug HBASE-4222.
bq.      https://issues.apache.org/jira/browse/HBASE-4222
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    src/main/java/org/apache/hadoop/hbase/regionserver/LogRoller.java 8e87c83 
bq.    src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLog.java 887f736 
bq.    src/test/java/org/apache/hadoop/hbase/regionserver/wal/TestLogRolling.java 287f1fb

bq.  
bq.  Diff: https://reviews.apache.org/r/1590/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  Added a new test for rolling data nodes under a running cluster: TestLogRolling.testLogRollOnPipelineRestart().
bq.  
bq.  Tested patch on a running cluster with 3 slaves, rolling data nodes with and without
concurrent write activity.
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Gary
bq.  
bq.



> Make HLog more resilient to write pipeline failures
> ---------------------------------------------------
>
>                 Key: HBASE-4222
>                 URL: https://issues.apache.org/jira/browse/HBASE-4222
>             Project: HBase
>          Issue Type: Improvement
>          Components: wal
>            Reporter: Gary Helmling
>            Assignee: Gary Helmling
>             Fix For: 0.92.0
>
>
> The current implementation of HLog rolling to recover from transient errors in the write
pipeline seems to have two problems:
> # When {{HLog.LogSyncer}} triggers an {{IOException}} during time-based sync operations,
it triggers a log rolling request in the corresponding catch block, but only after escaping
from the internal while loop.  As a result, the {{LogSyncer}} thread will exit and never be
restarted from what I can tell, even if the log rolling was successful.
> # Log rolling requests triggered by an {{IOException}} in {{sync()}} or {{append()}}
never happen if no entries have yet been written to the log.  This means that write errors
are not immediately recovered, which extends the exposure to more errors occurring in the
pipeline.
> In addition, it seems like we should be able to better handle transient problems, like
a rolling restart of DataNodes while the HBase RegionServers are running.  Currently this
will reliably cause RegionServer aborts during log rolling: either an append or time-based
sync triggers an initial {{IOException}}, initiating a log rolling request.  However the log
rolling then fails in closing the current writer ("All datanodes are bad"), causing a RegionServer
abort.  In this case, it seems like we should at least allow you an option to continue with
the new writer and only abort on subsequent errors.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message