accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Keith Turner (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (ACCUMULO-3234) Merge fails to complete
Date Wed, 15 Oct 2014 16:46:34 GMT

    [ https://issues.apache.org/jira/browse/ACCUMULO-3234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14172577#comment-14172577
] 

Keith Turner commented on ACCUMULO-3234:
----------------------------------------

These steps to reproduce, do you think this what happened in the RW test?

> Merge fails to complete
> -----------------------
>
>                 Key: ACCUMULO-3234
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-3234
>             Project: Accumulo
>          Issue Type: Bug
>          Components: master, tserver
>    Affects Versions: 1.6.1
>            Reporter: Josh Elser
>            Assignee: Josh Elser
>            Priority: Blocker
>             Fix For: 1.6.2, 1.7.0
>
>
> Running the Shard module of Randomwalk.
> Noticed that some driver code around the test had timed out:
> {noformat}
> 14 02:44:21,839 [shard.Merge] DEBUG: merging ST_index_hostname_24487_1413254628126
> 14 02:46:22,129 [impl.ThriftTransportPool] WARN : Thread "org.apache.accumulo.test.randomwalk.Framework"
stuck on IO  to master:9999 (0) for at least 120050 ms
> 14 02:49:21,807 [randomwalk.Module] WARN : Node org.apache.accumulo.test.randomwalk.shard.Merge
has been running for 300.003 seconds. You may want to look into it.
> {noformat}
> A few seconds later, the master complains about the tabletserver failing to unload some
tablets for the table we're merging (presumably to do the merge), and the TabletServer hits
an NPE trying to run a MinC (line numbers screwed up, so omitting the stack traces). We start
down trying to get a new filename
> {code}
> FileRef newMapfileLocation = getNextMapFilename(mergeFile == null ? "F" : "M");
> {code}
> Which ultimately threw an AccessControlException from Hadoop because it tried to write
to the root of HDFS instead of the provided directory (issue #1).
> This also cause an NPE in the finally block for the MinorCompactionTask on (issue #2)
> {code}
> minorCompaction.data("numEntries", Long.toString(this.stats.getNumEntries()));
> {code}
> Since all of this (+12hrs), the client is still blocked on the merge to complete, the
tserver is stuck in a loop trying to unload the tablets
> {noformat}
> 2014-10-14 22:53:48,761 [tserver.Tablet] DEBUG: initiateClose(saveState=true queueMinC=false
disableWrites=false) 2w;00001a;000018
> 2014-10-14 22:53:48,761 [tserver.TabletServer] DEBUG: Failed to unload tablet 2w;00001a;000018...
it was alread closing or closed : Tablet 2w;00001a;000018 already closing
> {noformat}
> the master is stuck in a loop waiting for those tablets to go offline (I think)
> {noformat}
> 2014-10-14 22:54:40,680 [state.MergeStats] INFO : Computing next merge state for 2w<<
which is presently WAITING_FOR_OFFLINE isDelete : false
> 2014-10-14 22:54:40,680 [state.MergeStats] INFO : 21 tablets are chopped, 0 are offline
2w<<
> 2014-10-14 22:54:40,680 [state.MergeStats] INFO : Waiting for 0 unassigned tablets to
be 21 2w<<
> 2014-10-14 22:54:40,680 [master.Master] DEBUG: [Normal Tablets] sleeping for 60.00 seconds
> 2014-10-14 22:54:40,684 [master.Master] DEBUG: Finished gathering information from 1
servers in 0.00 seconds
> 2014-10-14 22:54:40,685 [balancer.DefaultLoadBalancer] DEBUG: balance ended with 0 migrations
> 2014-10-14 22:54:40,685 [balancer.DefaultLoadBalancer] DEBUG: balance ended with 0 migrations
> 2014-10-14 22:54:40,685 [balancer.DefaultLoadBalancer] DEBUG: balance ended with 0 migrations
> 2014-10-14 22:54:40,686 [balancer.DefaultLoadBalancer] DEBUG: balance ended with 0 migrations
> 2014-10-14 22:54:40,686 [balancer.DefaultLoadBalancer] DEBUG: balance ended with 0 migrations
> 2014-10-14 22:54:40,686 [balancer.DefaultLoadBalancer] DEBUG: balance ended with 0 migrations
> 2014-10-14 22:54:40,804 [state.ZooTabletStateStore] DEBUG: Returning root tablet state:
+r<<@(null,hostname:9997[1490be48ae20006],hostname:9997[1490be48ae20006])
> 2014-10-14 22:54:40,804 [master.Master] DEBUG: [Root Table]: scan time 0.00 seconds
> 2014-10-14 22:54:40,804 [master.Master] DEBUG: [Root Table] sleeping for 60.00 seconds
> 2014-10-14 22:54:40,830 [master.Master] DEBUG: [Metadata Tablets]: scan time 0.04 seconds
> 2014-10-14 22:54:40,830 [master.Master] DEBUG: [Metadata Tablets] sleeping for 60.00
seconds
> 2014-10-14 22:54:40,830 [master.Master] DEBUG: mergeInfo overlaps: 2w;000002< true
> 2014-10-14 22:54:40,831 [master.Master] DEBUG: mergeInfo overlaps: 2w;000004;000002 true
> 2014-10-14 22:54:40,831 [master.Master] DEBUG: mergeInfo overlaps: 2w;000006;000004 true
> 2014-10-14 22:54:40,832 [master.Master] DEBUG: mergeInfo overlaps: 2w;000008;000006 true
> 2014-10-14 22:54:40,832 [master.Master] DEBUG: mergeInfo overlaps: 2w;00000a;000008 true
> 2014-10-14 22:54:40,833 [master.Master] DEBUG: mergeInfo overlaps: 2w;00000c;00000a true
> 2014-10-14 22:54:40,833 [master.Master] DEBUG: mergeInfo overlaps: 2w;00000e;00000c true
> 2014-10-14 22:54:40,834 [master.Master] DEBUG: mergeInfo overlaps: 2w;000010;00000e true
> 2014-10-14 22:54:40,834 [master.Master] DEBUG: mergeInfo overlaps: 2w;000012;000010 true
> 2014-10-14 22:54:40,835 [master.Master] DEBUG: mergeInfo overlaps: 2w;000014;000012 true
> 2014-10-14 22:54:40,835 [master.Master] DEBUG: mergeInfo overlaps: 2w;000016;000014 true
> 2014-10-14 22:54:40,835 [master.Master] DEBUG: mergeInfo overlaps: 2w;000018;000016 true
> 2014-10-14 22:54:40,836 [master.Master] DEBUG: mergeInfo overlaps: 2w;00001a;000018 true
> 2014-10-14 22:54:40,836 [master.Master] DEBUG: mergeInfo overlaps: 2w;00001c;00001a true
> 2014-10-14 22:54:40,836 [master.Master] DEBUG: mergeInfo overlaps: 2w;00001e;00001c true
> 2014-10-14 22:54:40,837 [master.Master] DEBUG: mergeInfo overlaps: 2w;000020;00001e true
> 2014-10-14 22:54:40,837 [master.Master] DEBUG: mergeInfo overlaps: 2w;000022;000020 true
> 2014-10-14 22:54:40,837 [master.Master] DEBUG: mergeInfo overlaps: 2w;000024;000022 true
> 2014-10-14 22:54:40,838 [master.Master] DEBUG: mergeInfo overlaps: 2w;000026;000024 true
> 2014-10-14 22:54:40,838 [master.Master] DEBUG: mergeInfo overlaps: 2w;000028;000026 true
> 2014-10-14 22:54:40,839 [master.Master] DEBUG: mergeInfo overlaps: 2w<;000028 true
> 2014-10-14 22:54:40,839 [master.Master] DEBUG: [Normal Tablets]: scan time 0.04 seconds
> 2014-10-14 22:54:40,839 [master.EventCoordinator] INFO : [Normal Tablets]: 21 tablets
unloaded
> {noformat}
> And the fate op is still locked (combined issue #3)
> {noformat}
> txid: 3a638f9642050e1a  status: IN_PROGRESS         op: TableRangeOp     locked: [W:2w,
R:+default] locking: []              top: TableRangeOpWait
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message