accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eric Newton (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (ACCUMULO-2269) Multiple hung fate operations during randomwalk with agitation
Date Tue, 28 Jan 2014 19:44:40 GMT

    [ https://issues.apache.org/jira/browse/ACCUMULO-2269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13884498#comment-13884498
] 

Eric Newton commented on ACCUMULO-2269:
---------------------------------------

When a bulk load is run, the updates to the tablets check to make sure the bulk load is still
in progress.  If not, the files may have already been moved away.  Something was jamming up
the tablet servers and they were just processing old requests.


> Multiple hung fate operations during randomwalk with agitation
> --------------------------------------------------------------
>
>                 Key: ACCUMULO-2269
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-2269
>             Project: Accumulo
>          Issue Type: Bug
>          Components: fate, master
>         Environment: 1.5.1-SNAPSHOT: 8981ba04
>            Reporter: Josh Elser
>            Priority: Critical
>             Fix For: 1.5.1
>
>
> Was running LongClean randomwalk with agitation. Came back to the system with three tables
"stuck" in DELETING on the monitor and a generally idle system. Upon investigation, multiple
fate txns appear to be deadlocked, in addition to the delete tables.
> {noformat}
> txid: 7ca950aa8de76a17  status: IN_PROGRESS         op: DeleteTable      locked: [W:2dc]
        locking: []              top: CleanUp
> txid: 1071086efdbed442  status: IN_PROGRESS         op: BulkImport       locked: [R:2cr]
        locking: []              top: LoadFiles
> txid: 32b86cfe06c2ed5d  status: IN_PROGRESS         op: DeleteTable      locked: [W:2d9]
        locking: []              top: CleanUp
> txid: 358c065b6cb0516b  status: IN_PROGRESS         op: DeleteTable      locked: [W:2dw]
        locking: []              top: CleanUp
> txid: 26b738ee0b044a96  status: IN_PROGRESS         op: BulkImport       locked: [R:2cr]
        locking: []              top: CopyFailed
> txid: 16edd31b3723dc5b  status: IN_PROGRESS         op: BulkImport       locked: [R:2cr]
        locking: []              top: CopyFailed
> txid: 63c587eb3df6c1b2  status: IN_PROGRESS         op: CompactRange     locked: [R:2cr]
        locking: []              top: CompactionDriver
> txid: 722d8e5488531735  status: IN_PROGRESS         op: BulkImport       locked: [R:2cr]
        locking: []              top: CopyFailed
> {noformat}
> I started digging into the DeleteTable ops. Each txn still appears to be active and holds
the table_lock for their respective table in ZK, but the /tables/id/ node and all of its children
(state, conf, name, etc) still exist.
> Looking at some thread dumps, I have the default (4) repo runner threads. 3 of them are
blocked on bulk imports
> {noformat}
> "Repo runner 2" daemon prio=10 tid=0x000000000262b800 nid=0x1ae7 waiting on condition
[0x00007f25168e7000]
>    java.lang.Thread.State: WAITING (parking)
>         at sun.misc.Unsafe.park(Native Method)
>         - parking to wait for  <0x0000000705a05eb8> (a java.util.concurrent.FutureTask)
>         at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
>         at java.util.concurrent.FutureTask.awaitDone(FutureTask.java:425)
>         at java.util.concurrent.FutureTask.get(FutureTask.java:187)
>         at org.apache.accumulo.server.master.tableOps.LoadFiles.call(BulkImport.java:561)
>         at org.apache.accumulo.server.master.tableOps.LoadFiles.call(BulkImport.java:449)
>         at org.apache.accumulo.server.master.tableOps.TraceRepo.call(TraceRepo.java:65)
>         at org.apache.accumulo.fate.Fate$TransactionRunner.run(Fate.java:64)
>         at org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:34)
>         at java.lang.Thread.run(Thread.java:744)
> {noformat}
> The 4th repo runner is stuck trying to reserve a new txn (not sure why he's locked like
this though)
> {noformat}
> "Repo runner 1" daemon prio=10 tid=0x0000000002627800 nid=0x1ae6 in Object.wait() [0x00007f25169e8000]
>    java.lang.Thread.State: WAITING (on object monitor)
>         at java.lang.Object.wait(Native Method)
>         at java.lang.Object.wait(Object.java:503)
>         at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1313)
>         - locked <0x00000007014d9928> (a org.apache.zookeeper.ClientCnxn$Packet)
>         at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1149)
>         at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1180)
>         at org.apache.accumulo.fate.zookeeper.ZooReader.getData(ZooReader.java:44)
>         at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:606)
>         at org.apache.accumulo.server.zookeeper.ZooReaderWriter$1.invoke(ZooReaderWriter.java:67)
>         at com.sun.proxy.$Proxy11.getData(Unknown Source)
>         at org.apache.accumulo.fate.ZooStore.reserve(ZooStore.java:160)
>         at org.apache.accumulo.fate.AgeOffStore.reserve(AgeOffStore.java:156)
>         at org.apache.accumulo.fate.Fate$TransactionRunner.run(Fate.java:52)
>         at org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:34)
>         at java.lang.Thread.run(Thread.java:744)
> {noformat}
> There were no obvious errors on the monitor, and the master is still presently in this
state.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Mime
View raw message