hadoop-zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: How to handle "Node does not exist" error?
Date Thu, 12 Aug 2010 04:01:31 GMT
Try running the server in non-embedded mode.

Also, you are assuming that you know everything about how to configure the
quorumPeer.  That is going to change and your code will break at that time.
 If you use a non-embedded cluster, this won't be a problem and you will be
able to upgrade ZK version without having to restart your service.

My own opinion is that running an embedded ZK is a serious architectural
error.  Since I don't know your particular situation, it might be different,
but there is an inherent contradiction involved in running a coordination
layer as part of the thing being coordinated.  Whatever your software does,
it isn't what ZK does.  As such, it is better to factor out the ZK
functionality and make it completely stable.  That gives you a much simpler
world and will make it easier for you to trouble shoot your system.  The
simple fact that you can't take down your service without affecting the
reliability of your ZK layer makes this a very bad idea.

The problems you are having now are only a preview of what this
architectural error leads to.  There will be more problems and many of them
are likely to be more subtle and lead to service interruptions and lots of
wasted time.

On Wed, Aug 11, 2010 at 8:49 PM, Dr Hao He <he@softtouchit.com> wrote:

> hi, Ted and Mahadev,
>
>
> Here are some more details about my setup:
>
> I run zookeeper in the embedded mode with the following code:
>
>                                        quorumPeer = new QuorumPeer();
>
>  quorumPeer.setClientPort(getClientPort());
>                                        quorumPeer.setTxnFactory(new
> FileTxnSnapLog(new File(getDataLogDir()), new File(getDataDir())));
>
>  quorumPeer.setQuorumPeers(getServers());
>
>  quorumPeer.setElectionType(getElectionAlg());
>                                        quorumPeer.setMyid(getServerId());
>
>  quorumPeer.setTickTime(getTickTime());
>
>  quorumPeer.setInitLimit(getInitLimit());
>
>  quorumPeer.setSyncLimit(getSyncLimit());
>
>  quorumPeer.setQuorumVerifier(getQuorumVerifier());
>
>  quorumPeer.setCnxnFactory(cnxnFactory);
>                                        quorumPeer.start();
>
>
> The configuration values are read from the following XML document for
> server 1:
>
> <cluster tickTime="1000" initLimit="10" syncLimit="5" clientPort="2181"
> serverId="1">
>                  <member id="1" host="192.168.2.6:2888:3888"/>
>                  <member id="2" host="192.168.2.3:2888:3888"/>
>                  <member id="3" host="192.168.2.4:2888:3888"/>
> </cluster>
>
>
> The other servers have the same configurations except their ids being
> changed to 2 and 3.
>
> The error occurred on server 3 when I batch loaded some messages to server
> 1.  However, this error does not always happen.  I am not sure exactly what
> trigged this error yet.
>
> I also performed the "stat" operation on one of the "No exit" node and got:
>
> stat
> /xpe/queues/3bd7851e79381ef4bfd1a5857b5e34c04e5159e5/msgs/msg0000001583
> Exception in thread "main" java.lang.NullPointerException
>        at
> org.apache.zookeeper.ZooKeeperMain.printStat(ZooKeeperMain.java:129)
>        at
> org.apache.zookeeper.ZooKeeperMain.processZKCmd(ZooKeeperMain.java:715)
>        at
> org.apache.zookeeper.ZooKeeperMain.processCmd(ZooKeeperMain.java:579)
>        at
> org.apache.zookeeper.ZooKeeperMain.executeLine(ZooKeeperMain.java:351)
>        at org.apache.zookeeper.ZooKeeperMain.run(ZooKeeperMain.java:309)
>        at org.apache.zookeeper.ZooKeeperMain.main(ZooKeeperMain.java:268)
> [xpe@t43 zookeeper-3.2.2]$ bin/zkCli.sh
>
>
> Those message nodes are created as CreateMode.PERSISTENT_SEQUENTIAL and are
> deleted by the last server who has read them.
>
> If I remove the troubled server's zookeeper log directory and restart the
> server, then everything is ok.
>
> I will try to get the nc result next time I see this problem.
>
>
> Dr Hao He
>
> XPE - the truly SOA platform
>
> he@softtouchit.com
> http://softtouchit.com
> http://itunes.com/apps/Scanmobile
>
> On 12/08/2010, at 12:32 AM, Mahadev Konar wrote:
>
> > HI Dr Hao,
> >  Can you please post the configuration of all the 3 zookeeper servers? I
> > suspect it might be misconfigured clusters and they might not belong to
> the
> > same ensemble.
> >
> > Just to be clear:
> > /xpe/queues/3bd7851e79381ef4bfd1a5857b5e34c04e5159e5/msgs/msg0000002807
> >
> > And other such nodes exist on one of the zookeeper servers and the same
> node
> > does not exist on other servers?
> >
> > Also, as ted pointed out, can you please post the output of echo ³stat² |
> nc
> > localhost 2181 (on all the 3 servers) to the list?
> >
> > Thanks
> > mahadev
> >
> >
> >
> > On 8/11/10 12:10 AM, "Dr Hao He" <he@softtouchit.com> wrote:
> >
> >> hi, Ted,
> >>
> >> Thanks for the reply.  Here is what I did:
> >>
> >> [zk: localhost:2181(CONNECTED) 0] ls
> >> /xpe/queues/3bd7851e79381ef4bfd1a5857b5e34c04e5159e5/msgs/msg0000002948
> >> []
> >> zk: localhost:2181(CONNECTED) 1] ls
> >> /xpe/queues/3bd7851e79381ef4bfd1a5857b5e34c04e5159e5/msgs
> >> [msg0000002807, msg0000002700, msg0000002701, msg0000002804,
> msg0000002704,
> >> msg0000002706, msg0000002601, msg0000001849, msg0000001847,
> msg0000002508,
> >> msg0000002609, msg0000001841, msg0000002607, msg0000002606,
> msg0000002604,
> >> msg0000002809, msg0000002817, msg0000001633, msg0000002812,
> msg0000002814,
> >> msg0000002711, msg0000002815, msg0000002713, msg0000002716,
> msg0000001772,
> >> msg0000002811, msg0000001635, msg0000001774, msg0000002515,
> msg0000002610,
> >> msg0000001838, msg0000002517, msg0000002612, msg0000002519,
> msg0000001973,
> >> msg0000001835, msg0000001974, msg0000002619, msg0000001831,
> msg0000002510,
> >> msg0000002512, msg0000002615, msg0000002614, msg0000002617,
> msg0000002104,
> >> msg0000002106, msg0000001769, msg0000001768, msg0000002828,
> msg0000002822,
> >> msg0000001760, msg0000002820, msg0000001963, msg0000001961,
> msg0000002110,
> >> msg0000002118, msg0000002900, msg0000002836, msg0000001757,
> msg0000002907,
> >> msg0000001753, msg0000001752, msg0000001755, msg0000001952,
> msg0000001958,
> >> msg0000001852, msg0000001956, msg0000001854, msg0000002749,
> msg0000001608,
> >> msg0000001609, msg0000002747, msg0000002882, msg0000001743,
> msg0000002888,
> >> msg0000001605, msg0000002885, msg0000001487, msg0000001746,
> msg0000002330,
> >> msg0000001749, msg0000001488, msg0000001489, msg0000001881,
> msg0000001491,
> >> msg0000002890, msg0000001889, msg0000002758, msg0000002241,
> msg0000002892,
> >> msg0000002852, msg0000002759, msg0000002898, msg0000002850,
> msg0000001733,
> >> msg0000002751, msg0000001739, msg0000002753, msg0000002756,
> msg0000002332,
> >> msg0000001872, msg0000002233, msg0000001721, msg0000001627,
> msg0000001720,
> >> msg0000001625, msg0000001628, msg0000001629, msg0000001729,
> msg0000002350,
> >> msg0000001727, msg0000002352, msg0000001622, msg0000001726,
> msg0000001623,
> >> msg0000001723, msg0000001724, msg0000001621, msg0000002736,
> msg0000002738,
> >> msg0000002363, msg0000001717, msg0000002878, msg0000002362,
> msg0000002361,
> >> msg0000001611, msg0000001894, msg0000002357, msg0000002218,
> msg0000002358,
> >> msg0000002355, msg0000001895, msg0000002356, msg0000001898,
> msg0000002354,
> >> msg0000001996, msg0000001990, msg0000002093, msg0000002880,
> msg0000002576,
> >> msg0000002579, msg0000002267, msg0000002266, msg0000002366,
> msg0000001901,
> >> msg0000002365, msg0000001903, msg0000001799, msg0000001906,
> msg0000002368,
> >> msg0000001597, msg0000002679, msg0000002166, msg0000001595,
> msg0000002481,
> >> msg0000002482, msg0000002373, msg0000002374, msg0000002371,
> msg0000001599,
> >> msg0000002773, msg0000002274, msg0000002275, msg0000002270,
> msg0000002583,
> >> msg0000002271, msg0000002580, msg0000002067, msg0000002277,
> msg0000002278,
> >> msg0000002376, msg0000002180, msg0000002467, msg0000002378,
> msg0000002182,
> >> msg0000002377, msg0000002184, msg0000002379, msg0000002187,
> msg0000002186,
> >> msg0000002665, msg0000002666, msg0000002381, msg0000002382,
> msg0000002661,
> >> msg0000002662, msg0000002663, msg0000002385, msg0000002284,
> msg0000002766,
> >> msg0000002282, msg0000002190, msg0000002599, msg0000002054,
> msg0000002596,
> >> msg0000002453, msg0000002459, msg0000002457, msg0000002456,
> msg0000002191,
> >> msg0000002652, msg0000002395, msg0000002650, msg0000002656,
> msg0000002655,
> >> msg0000002189, msg0000002047, msg0000002658, msg0000002659,
> msg0000002796,
> >> msg0000002250, msg0000002255, msg0000002589, msg0000002257,
> msg0000002061,
> >> msg0000002064, msg0000002585, msg0000002258, msg0000002587,
> msg0000002444,
> >> msg0000002446, msg0000002447, msg0000002450, msg0000002646,
> msg0000001501,
> >> msg0000002591, msg0000002592, msg0000001503, msg0000001506,
> msg0000002260,
> >> msg0000002594, msg0000002262, msg0000002263, msg0000002264,
> msg0000002590,
> >> msg0000002132, msg0000002130, msg0000002530, msg0000002931,
> msg0000001559,
> >> msg0000001808, msg0000002024, msg0000001553, msg0000002939,
> msg0000002937,
> >> msg0000001556, msg0000002935, msg0000002933, msg0000002140,
> msg0000001937,
> >> msg0000002143, msg0000002520, msg0000002522, msg0000002429,
> msg0000002524,
> >> msg0000002920, msg0000002035, msg0000001561, msg0000002134,
> msg0000002138,
> >> msg0000002925, msg0000002151, msg0000002287, msg0000002555,
> msg0000002010,
> >> msg0000002002, msg0000002290, msg0000001537, msg0000002005,
> msg0000002147,
> >> msg0000002145, msg0000002698, msg0000001592, msg0000001810,
> msg0000002690,
> >> msg0000002691, msg0000001911, msg0000001910, msg0000002693,
> msg0000001812,
> >> msg0000001817, msg0000001547, msg0000002012, msg0000002015,
> msg0000002941,
> >> msg0000001688, msg0000002018, msg0000002684, msg0000002944,
> msg0000001540,
> >> msg0000002686, msg0000001541, msg0000002946, msg0000002688,
> msg0000001584,
> >> msg0000002948]
> >>
> >> [zk: localhost:2181(CONNECTED) 7] delete
> >> /xpe/queues/3bd7851e79381ef4bfd1a5857b5e34c04e5159e5/msgs/msg0000002948
> >> Node does not exist:
> >> /xpe/queues/3bd7851e79381ef4bfd1a5857b5e34c04e5159e5/msgs/msg0000002948
> >>
> >> When I performed the same operations on another node, none of those
> nodes
> >> existed.
> >>
> >>
> >> Dr Hao He
> >>
> >> XPE - the truly SOA platform
> >>
> >> he@softtouchit.com
> >> http://softtouchit.com
> >> http://itunes.com/apps/Scanmobile
> >>
> >> On 11/08/2010, at 4:38 PM, Ted Dunning wrote:
> >>
> >>> Can you provide some more information?  The output of some of the four
> >>> letter commands and a transcript of what you are doing would be very
> >>> helpful.
> >>>
> >>> Also, there is no way for znodes to exist on one node of a properly
> >>> operating ZK cluster and not on either of the other two.  Something has
> to
> >>> be wrong and I would vote for operator error (not to cast aspersions,
> it is
> >>> just that humans like you and *me* make more errors than ZK does).
> >>>
> >>> On Tue, Aug 10, 2010 at 11:32 PM, Dr Hao He <he@softtouchit.com>
> wrote:
> >>>
> >>>> hi, All,
> >>>>
> >>>> I have a 3-host cluster running ZooKeeper 3.2.2.  On one of the hosts,
> >>>> there are a number of nodes that I can "get" and "ls" using zkCli.sh
.
> >>>> However, when I tried to "delete" any of them, I got "Node does not
> exist"
> >>>> error.    Those nodes do not exist on the other two hosts.
> >>>>
> >>>> Any idea how we should handle this type of errors and what might have
> >>>> caused this problem?
> >>>>
> >>>> Dr Hao He
> >>>>
> >>>> XPE - the truly SOA platform
> >>>>
> >>>> he@softtouchit.com
> >>>> http://softtouchit.com
> >>>> http://itunes.com/apps/Scanmobile
> >>>>
> >>>>
> >>
> >>
> >
> >
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message