hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jeffrey Zhong (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-8276) Backport hbase-6738 to 0.94 "Too aggressive task resubmission from the distributed log manager"
Date Fri, 05 Apr 2013 22:43:16 GMT

    [ https://issues.apache.org/jira/browse/HBASE-8276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13624140#comment-13624140
] 

Jeffrey Zhong commented on HBASE-8276:
--------------------------------------

[~lhofhansl] For your question:
{quote}
In 0.94 can we get away with just increasing the timeout to 5mins?
{quote}
A good question. I guess not. The reason we could possibly increase the timeout to 5mins because
the patch adds the logic to immediately resubmit a task when a dead server is detected without
reaching timeout(while, before the fix, SplitLogManager waits for 25 secs timeout no matter
if a server is already marked as dead in master). Back to your question, if we just bump up
the time out to 5mins, it will hurt recovery time because log splitting slow down for normal
dead server scenarios.
                
> Backport hbase-6738 to 0.94 "Too aggressive task resubmission from the distributed log
manager"
> -----------------------------------------------------------------------------------------------
>
>                 Key: HBASE-8276
>                 URL: https://issues.apache.org/jira/browse/HBASE-8276
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Jeffrey Zhong
>            Assignee: Jeffrey Zhong
>         Attachments: hbase-8276.patch, hbase-8276-v1.patch
>
>
> In recent tests, we found situations that when some data nodes are down and file operations
are slow depending on underlying hdfs timeout(normally 30 secs and socket connection timeout
maybe around 1 min). While split log task heart beat time out is only 25 secs, a split log
task will be preempted by SplitLogManager and assign to someone else after the 25 secs. On
a small cluster, you'll see the same task is keeping bounced back & force for a while.
I pasted a snippet of related logs below. You can search "preempted from" to see a task is
preempted.
> {code}
> 2013-04-01 21:22:08,599 INFO org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting
hlog: hdfs://ip-10-137-16-140.ec2.internal:8020/apps/hbase/data/.logs/ip-10-137-20-188.ec2.internal,60020,1364849530779-splitting/ip-10-137-20-188.ec2.internal%2C60020%2C1364849530779.1364865506159,
length=127639653
> 2013-04-01 21:22:08,599 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: Recovering file
hdfs://ip-10-137-16-140.ec2.internal:8020/apps/hbase/data/.logs/ip-10-137-20-188.ec2.internal,60020,1364849530779-splitting/ip-10-137-20-188.ec2.internal%2C60020%2C1364849530779.1364865506159
> 2013-04-01 21:22:09,603 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: Finished lease
recover attempt for hdfs://ip-10-137-16-140.ec2.internal:8020/apps/hbase/data/.logs/ip-10-137-20-188.ec2.internal,60020,1364849530779-splitting/ip-10-137-20-188.ec2.internal%2C60020%2C1364849530779.1364865506159
> 2013-04-01 21:22:09,629 WARN org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Found
existing old edits file. It could be the result of a previous failed split attempt. Deleting
hdfs://ip-10-137-16-140.ec2.internal:8020/apps/hbase/data/IntegrationTestLoadAndVerify/73387f8d327a45bacf069bd631d70b3b/recovered.edits/0000000000003703447.temp,
length=0
> 2013-04-01 21:22:09,629 WARN org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Found
existing old edits file. It could be the result of a previous failed split attempt. Deleting
hdfs://ip-10-137-16-140.ec2.internal:8020/apps/hbase/data/IntegrationTestLoadAndVerify/b749cbceaaf037c97f70cc2a6f48f2b8/recovered.edits/0000000000003703446.temp,
length=0
> 2013-04-01 21:22:09,630 WARN org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Found
existing old edits file. It could be the result of a previous failed split attempt. Deleting
hdfs://ip-10-137-16-140.ec2.internal:8020/apps/hbase/data/IntegrationTestLoadAndVerify/c26b9d4a042d42c1194a8c2f389d33c8/recovered.edits/0000000000003703448.temp,
length=0
> 2013-04-01 21:22:09,666 WARN org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Found
existing old edits file. It could be the result of a previous failed split attempt. Deleting
hdfs://ip-10-137-16-140.ec2.internal:8020/apps/hbase/data/IntegrationTestLoadAndVerify/adabdb40ccd52140f09f953ff41fd829/recovered.edits/0000000000003703451.temp,
length=0
> 2013-04-01 21:22:09,722 WARN org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Found
existing old edits file. It could be the result of a previous failed split attempt. Deleting
hdfs://ip-10-137-16-140.ec2.internal:8020/apps/hbase/data/IntegrationTestLoadAndVerify/19f463fe74f4365e7df3e5fdb13aecad/recovered.edits/0000000000003703468.temp,
length=0
> 2013-04-01 21:22:09,734 WARN org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Found
existing old edits file. It could be the result of a previous failed split attempt. Deleting
hdfs://ip-10-137-16-140.ec2.internal:8020/apps/hbase/data/IntegrationTestLoadAndVerify/b3e759a3fc9c4e83064961cc3cd4a911/recovered.edits/0000000000003703459.temp,
length=0
> 2013-04-01 21:22:09,770 WARN org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Found
existing old edits file. It could be the result of a previous failed split attempt. Deleting
hdfs://ip-10-137-16-140.ec2.internal:8020/apps/hbase/data/IntegrationTestLoadAndVerify/6f078553be50897a986734ae043a5889/recovered.edits/0000000000003703454.temp,
length=0
> 2013-04-01 21:22:34,985 INFO org.apache.hadoop.hbase.regionserver.SplitLogWorker: task
/hbase/splitlog/hdfs%3A%2F%2Fip-10-137-16-140.ec2.internal%3A8020%2Fapps%2Fhbase%2Fdata%2F.logs%2Fip-10-137-20-188.ec2.internal%2C60020%2C1364849530779-splitting%2Fip-10-137-20-188.ec2.internal%252C60020%252C1364849530779.1364865506159
preempted from ip-10-151-29-196.ec2.internal,60020,1364849530671, current task state and owner=unassigned
ip-10-137-16-140.ec2.internal,60000,1364849528428
> {code}
> The exact same issue is fixed by hbase-6738 in trunk so here comes the backport. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message