hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jeffrey Zhong (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HBASE-8276) Backport hbase-6738 to 0.94 "Too aggressive task resubmission from the distributed log manager"
Date Fri, 05 Apr 2013 01:15:14 GMT

     [ https://issues.apache.org/jira/browse/HBASE-8276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Jeffrey Zhong updated HBASE-8276:
---------------------------------

    Attachment: hbase-8276.patch

[~nkeywal] Could you please review this if it's good on 0.94 branch?

Thanks,
-Jeffrey
                
> Backport hbase-6738 to 0.94 "Too aggressive task resubmission from the distributed log
manager"
> -----------------------------------------------------------------------------------------------
>
>                 Key: HBASE-8276
>                 URL: https://issues.apache.org/jira/browse/HBASE-8276
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Jeffrey Zhong
>            Assignee: Jeffrey Zhong
>         Attachments: hbase-8276.patch
>
>
> In recent tests, we found situations that when some data nodes are down and file operations
are slow depending on underlying hdfs timeout(normally 30 secs and socket connection timeout
maybe around 1 min). While split log task heart beat time out is only 25 secs, a split log
task will be preempted by SplitLogManager and assign to someone else after the 25 secs. On
a small cluster, you'll see the same task is keeping bounced back & force for a while.
I pasted a snippet of related logs below. You can search "preempted from" to see a task is
preempted.
> {code}
> 2013-04-01 21:22:08,599 INFO org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting
hlog: hdfs://ip-10-137-16-140.ec2.internal:8020/apps/hbase/data/.logs/ip-10-137-20-188.ec2.internal,60020,1364849530779-splitting/ip-10-137-20-188.ec2.internal%2C60020%2C1364849530779.1364865506159,
length=127639653
> 2013-04-01 21:22:08,599 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: Recovering file
hdfs://ip-10-137-16-140.ec2.internal:8020/apps/hbase/data/.logs/ip-10-137-20-188.ec2.internal,60020,1364849530779-splitting/ip-10-137-20-188.ec2.internal%2C60020%2C1364849530779.1364865506159
> 2013-04-01 21:22:09,603 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: Finished lease
recover attempt for hdfs://ip-10-137-16-140.ec2.internal:8020/apps/hbase/data/.logs/ip-10-137-20-188.ec2.internal,60020,1364849530779-splitting/ip-10-137-20-188.ec2.internal%2C60020%2C1364849530779.1364865506159
> 2013-04-01 21:22:09,629 WARN org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Found
existing old edits file. It could be the result of a previous failed split attempt. Deleting
hdfs://ip-10-137-16-140.ec2.internal:8020/apps/hbase/data/IntegrationTestLoadAndVerify/73387f8d327a45bacf069bd631d70b3b/recovered.edits/0000000000003703447.temp,
length=0
> 2013-04-01 21:22:09,629 WARN org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Found
existing old edits file. It could be the result of a previous failed split attempt. Deleting
hdfs://ip-10-137-16-140.ec2.internal:8020/apps/hbase/data/IntegrationTestLoadAndVerify/b749cbceaaf037c97f70cc2a6f48f2b8/recovered.edits/0000000000003703446.temp,
length=0
> 2013-04-01 21:22:09,630 WARN org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Found
existing old edits file. It could be the result of a previous failed split attempt. Deleting
hdfs://ip-10-137-16-140.ec2.internal:8020/apps/hbase/data/IntegrationTestLoadAndVerify/c26b9d4a042d42c1194a8c2f389d33c8/recovered.edits/0000000000003703448.temp,
length=0
> 2013-04-01 21:22:09,666 WARN org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Found
existing old edits file. It could be the result of a previous failed split attempt. Deleting
hdfs://ip-10-137-16-140.ec2.internal:8020/apps/hbase/data/IntegrationTestLoadAndVerify/adabdb40ccd52140f09f953ff41fd829/recovered.edits/0000000000003703451.temp,
length=0
> 2013-04-01 21:22:09,722 WARN org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Found
existing old edits file. It could be the result of a previous failed split attempt. Deleting
hdfs://ip-10-137-16-140.ec2.internal:8020/apps/hbase/data/IntegrationTestLoadAndVerify/19f463fe74f4365e7df3e5fdb13aecad/recovered.edits/0000000000003703468.temp,
length=0
> 2013-04-01 21:22:09,734 WARN org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Found
existing old edits file. It could be the result of a previous failed split attempt. Deleting
hdfs://ip-10-137-16-140.ec2.internal:8020/apps/hbase/data/IntegrationTestLoadAndVerify/b3e759a3fc9c4e83064961cc3cd4a911/recovered.edits/0000000000003703459.temp,
length=0
> 2013-04-01 21:22:09,770 WARN org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Found
existing old edits file. It could be the result of a previous failed split attempt. Deleting
hdfs://ip-10-137-16-140.ec2.internal:8020/apps/hbase/data/IntegrationTestLoadAndVerify/6f078553be50897a986734ae043a5889/recovered.edits/0000000000003703454.temp,
length=0
> 2013-04-01 21:22:34,985 INFO org.apache.hadoop.hbase.regionserver.SplitLogWorker: task
/hbase/splitlog/hdfs%3A%2F%2Fip-10-137-16-140.ec2.internal%3A8020%2Fapps%2Fhbase%2Fdata%2F.logs%2Fip-10-137-20-188.ec2.internal%2C60020%2C1364849530779-splitting%2Fip-10-137-20-188.ec2.internal%252C60020%252C1364849530779.1364865506159
preempted from ip-10-151-29-196.ec2.internal,60020,1364849530671, current task state and owner=unassigned
ip-10-137-16-140.ec2.internal,60000,1364849528428
> {code}
> The exact same issue is fixed by hbase-6738 in trunk so here comes the backport. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message