accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Keith Turner (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (ACCUMULO-2154) NoNodeException error in master
Date Tue, 21 Jan 2014 16:13:19 GMT

    [ https://issues.apache.org/jira/browse/ACCUMULO-2154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13877571#comment-13877571
] 

Keith Turner commented on ACCUMULO-2154:
----------------------------------------

I realize that synchronization will fix this if we assume only one process and one object
in that process access this resource.  But simply catching and ignoring no-node exception
will also fix the problem w/o those assumptions.   Synchronization is great when the resource
being protected is private process memory, however thats not true in this case.  ZooKeeper
is a cluster wide resource and its possible that any other process in the cluster could mutate
zookeeper at any time.  The way I see there are at least three options to solve this problem.

 # use java synchronization with assumptions stated above
 # use zookeeper primitives for dealing with concurrency
 # use java synchronization and zookeeper  primitives for dealing with concurrency

I am in favor of #2.   And its also very simple in this case, just ignore NoNodeException
because it indicates the node was deleted after the call to getChildren() was made.   
 
bq.  I've made post() also synchronized so that getList() doesn't miss any dead servers that
get added after zoo.getChildren() is called.

It will still miss those servers.  When synchronized, If one thread is in getList() then another
thread calling post() will block.   Detecting changes after getChildren is called is not needed,
just need a consistent snapshot at a point in time.  It could be achieved by checking getChildren
in a loop and waiting for it stabilize.   But the data could still  be outdated by other operations
immediately after getList() returns, so the code still has to treat it as a snapshot.   Anything
more would require some sort of transaction semantics across method calls, which is not needed.
 

> NoNodeException error in master
> -------------------------------
>
>                 Key: ACCUMULO-2154
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-2154
>             Project: Accumulo
>          Issue Type: Bug
>          Components: master
>         Environment: 1.6.0 sha 417902e218c566333b6ea5ac492186ae305e5e16
>            Reporter: John Vines
>            Assignee: Vikram Srivastava
>              Labels: PatchAvailable
>             Fix For: 1.6.0
>
>         Attachments: ACCUMULO-2154.v1.patch.txt
>
>
> I have a test that brings accumulo down hard after a minute and then brings it back up
again. I was running it overnight and I saw this stack trace once. Not sure if it's a problem
or not though.
> {code}org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode
for /accumulo/617ee3a7-98b9-4f5f-af13-8894afe7c33c/dead/tservers/10.10.1.148:9997
> 	org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /accumulo/617ee3a7-98b9-4f5f-af13-8894afe7c33c/dead/tservers/10.10.1.148:9997
> 		at org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
> 		at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> 		at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1151)
> 		at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1180)
> 		at org.apache.accumulo.fate.zookeeper.ZooReader.getData(ZooReader.java:45)
> 		at org.apache.accumulo.server.master.state.DeadServerList.getList(DeadServerList.java:52)
> 		at org.apache.accumulo.master.MasterClientServiceHandler.getMasterStats(MasterClientServiceHandler.java:268)
> 		at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 		at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 		at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 		at java.lang.reflect.Method.invoke(Method.java:597)
> 		at org.apache.accumulo.trace.instrument.thrift.TraceWrap$1.invoke(TraceWrap.java:63)
> 		at com.sun.proxy.$Proxy11.getMasterStats(Unknown Source)
> 		at org.apache.accumulo.core.master.thrift.MasterClientService$Processor$getMasterStats.getResult(MasterClientService.java:1414)
> 		at org.apache.accumulo.core.master.thrift.MasterClientService$Processor$getMasterStats.getResult(MasterClientService.java:1398)
> 		at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
> 		at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
> 		at org.apache.accumulo.server.util.TServerUtils$TimedProcessor.process(TServerUtils.java:171)
> 		at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:206)
> 		at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
> 		at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
> 		at java.lang.Thread.run(Thread.java:662){code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Mime
View raw message