hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jieshan Bean (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HBASE-4095) Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked
Date Fri, 05 Aug 2011 08:52:27 GMT

     [ https://issues.apache.org/jira/browse/HBASE-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Jieshan Bean updated HBASE-4095:
--------------------------------

    Attachment: HBase-4095-V4-Branch.patch
                TestResultForPatch-V4.rar

The analysis before this has told why did the large-Hlog file occur.
Be brief, it's because the variable of "HLog#logRollRequested". Once the rollWriter was requested,
it's value was set as true. But the rollWriter was able to executed in time. So the following
roll-requests were skippted. 

In the original code, it using the following code to try to get the initial Hlog replicas:
{noformat}
 int nextInitialReplication = fs.getFileStatus(newPath).getReplication(); 
{noformat}
But it's value always equals with the default replicas number.Once the actual datanode count
is less than default replicas value. rollWriter will be requested in each sync.
So rollWriter was executed one by one with a high frequency.

The patch of V4 is trying to solve both the "large-file" problem and the "high-frequency roll
request" problem.

1. Replace the variable of "HLog#logRollRequested" as "HLog#logRollRunning". A new request
will be skippted while another rollWriter was running.
2. Add a new variable named "minTolerableReplication". Once the actual replicas is less than
it, send a rollWriter request.
3. Once a sequence of unreasonable rollWriter executed, disable roll-request from checkLowReplication().
Enable it while replicas return to normal.

I've take many tests on the patch. Please check them from the attachment of "TestResultForPatch-V4.rar".

> Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is
blocked
> --------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4095
>                 URL: https://issues.apache.org/jira/browse/HBASE-4095
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>            Assignee: Jieshan Bean
>         Attachments: HBASE-4095-90-v2.patch, HBASE-4095-90.patch, HBASE-4095-trunk-v2.patch,
HBASE-4095-trunk.patch, HBase-4095-V4-Branch.patch, HlogFileIsVeryLarge.gif, RelatedLogs2011-07-28.txt,
TestResultForPatch-V4.rar, hbase-root-regionserver-193-195-5-111.rar, surefire-report-branch.html
>
>
> Some large Hlog files(Larger than 10G) appeared in our environment, and I got the reason
why they got so huge:
> 1. The replicas is less than the expect number. So the method of checkLowReplication
will be called each sync.
> 2. The method checkLowReplication request log-roll first, and set logRollRequested as
true: 
> {noformat}
> private void checkLowReplication() {
> // if the number of replicas in HDFS has fallen below the initial
> // value, then roll logs.
> try {
>   int numCurrentReplicas = getLogReplication();
>   if (numCurrentReplicas != 0 &&
> 	  numCurrentReplicas < this.initialReplication) {
> 	LOG.warn("HDFS pipeline error detected. " +
> 		"Found " + numCurrentReplicas + " replicas but expecting " +
> 		this.initialReplication + " replicas. " +
> 		" Requesting close of hlog.");
> 	requestLogRoll();
> 	logRollRequested = true;
>   }
> } catch (Exception e) {
>   LOG.warn("Unable to invoke DFSOutputStream.getNumCurrentReplicas" + e +
> 	  " still proceeding ahead...");
> }
> }
> {noformat}
> 3.requestLogRoll() just commit the roll request. It may not execute in time, for it must
got the un-fair lock of cacheFlushLock.
> But the lock may be carried by the cacheflush threads.
> 4.logRollRequested was true until the log-roll executed. So during the time, each request
of log-roll in sync() was skipped.
> Here's the logs while the problem happened(Please notice the file size of hlog "193-195-5-111%3A20020.1309937386639"
in the last row):
> 2011-07-06 15:28:59,284 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline
error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119,
entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 15:29:56,929 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: HDFS pipeline
error detected. Found 2 replicas but expecting 3 replicas.  Requesting close of hlog.
> 2011-07-06 15:29:56,933 INFO org.apache.hadoop.hbase.regionserver.Store: Renaming flushed
file at hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/.tmp/4656903854447026847
to hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983
> 2011-07-06 15:29:57,391 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983,
entries=445880, sequenceid=248900, memsize=207.5m, filesize=130.1m
> 2011-07-06 15:29:57,478 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore
flush of ~207.5m for region Htable_UFDR_034,07664,1309936974158.a3780cf0c909d8cf8f8ed618b290cc95.
in 10839ms, sequenceid=248900, compaction requested=false
> 2011-07-06 15:28:59,236 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309926531955,
entries=216459, filesize=2370387468. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119,
entries=32434, filesize=239589754. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll
period 3600000ms elapsed
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll
period 3600000ms elapsed
> 2011-07-06 16:30:01,978 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639,
entries=1135576, filesize=19220372830. New hlog /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309940998890

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message