hadoop-zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: How to handle "Node does not exist" error?
Date Thu, 12 Aug 2010 19:18:52 GMT
It doesn't.

But running a ZK cluster that is incorrectly configured can cause this
problem and configuring ZK using setters is likely to be subject to changes
in what configuration is needed.  Thus, your style of code is more subject
to decay over time than is nice.

The rest of my comments detail *other* reasons why embedding a coordination
layer in the code being coordinated is a bad idea.

On Thu, Aug 12, 2010 at 6:33 AM, Vishal K <vishalmlst@gmail.com> wrote:

> Hi Ted,
>
> Can you explain why running ZK in embedded mode can cause znode
> inconsistencies?
> Thanks.
>
> -Vishal
>
> On Thu, Aug 12, 2010 at 12:01 AM, Ted Dunning <ted.dunning@gmail.com>
> wrote:
>
> > Try running the server in non-embedded mode.
> >
> > Also, you are assuming that you know everything about how to configure
> the
> > quorumPeer.  That is going to change and your code will break at that
> time.
> >  If you use a non-embedded cluster, this won't be a problem and you will
> be
> > able to upgrade ZK version without having to restart your service.
> >
> > My own opinion is that running an embedded ZK is a serious architectural
> > error.  Since I don't know your particular situation, it might be
> > different,
> > but there is an inherent contradiction involved in running a coordination
> > layer as part of the thing being coordinated.  Whatever your software
> does,
> > it isn't what ZK does.  As such, it is better to factor out the ZK
> > functionality and make it completely stable.  That gives you a much
> simpler
> > world and will make it easier for you to trouble shoot your system.  The
> > simple fact that you can't take down your service without affecting the
> > reliability of your ZK layer makes this a very bad idea.
> >
> > The problems you are having now are only a preview of what this
> > architectural error leads to.  There will be more problems and many of
> them
> > are likely to be more subtle and lead to service interruptions and lots
> of
> > wasted time.
> >
> > On Wed, Aug 11, 2010 at 8:49 PM, Dr Hao He <he@softtouchit.com> wrote:
> >
> > > hi, Ted and Mahadev,
> > >
> > >
> > > Here are some more details about my setup:
> > >
> > > I run zookeeper in the embedded mode with the following code:
> > >
> > >                                        quorumPeer = new QuorumPeer();
> > >
> > >  quorumPeer.setClientPort(getClientPort());
> > >                                        quorumPeer.setTxnFactory(new
> > > FileTxnSnapLog(new File(getDataLogDir()), new File(getDataDir())));
> > >
> > >  quorumPeer.setQuorumPeers(getServers());
> > >
> > >  quorumPeer.setElectionType(getElectionAlg());
> > >
>  quorumPeer.setMyid(getServerId());
> > >
> > >  quorumPeer.setTickTime(getTickTime());
> > >
> > >  quorumPeer.setInitLimit(getInitLimit());
> > >
> > >  quorumPeer.setSyncLimit(getSyncLimit());
> > >
> > >  quorumPeer.setQuorumVerifier(getQuorumVerifier());
> > >
> > >  quorumPeer.setCnxnFactory(cnxnFactory);
> > >                                        quorumPeer.start();
> > >
> > >
> > > The configuration values are read from the following XML document for
> > > server 1:
> > >
> > > <cluster tickTime="1000" initLimit="10" syncLimit="5" clientPort="2181"
> > > serverId="1">
> > >                  <member id="1" host="192.168.2.6:2888:3888"/>
> > >                  <member id="2" host="192.168.2.3:2888:3888"/>
> > >                  <member id="3" host="192.168.2.4:2888:3888"/>
> > > </cluster>
> > >
> > >
> > > The other servers have the same configurations except their ids being
> > > changed to 2 and 3.
> > >
> > > The error occurred on server 3 when I batch loaded some messages to
> > server
> > > 1.  However, this error does not always happen.  I am not sure exactly
> > what
> > > trigged this error yet.
> > >
> > > I also performed the "stat" operation on one of the "No exit" node and
> > got:
> > >
> > > stat
> > > /xpe/queues/3bd7851e79381ef4bfd1a5857b5e34c04e5159e5/msgs/msg0000001583
> > > Exception in thread "main" java.lang.NullPointerException
> > >        at
> > > org.apache.zookeeper.ZooKeeperMain.printStat(ZooKeeperMain.java:129)
> > >        at
> > > org.apache.zookeeper.ZooKeeperMain.processZKCmd(ZooKeeperMain.java:715)
> > >        at
> > > org.apache.zookeeper.ZooKeeperMain.processCmd(ZooKeeperMain.java:579)
> > >        at
> > > org.apache.zookeeper.ZooKeeperMain.executeLine(ZooKeeperMain.java:351)
> > >        at
> org.apache.zookeeper.ZooKeeperMain.run(ZooKeeperMain.java:309)
> > >        at
> org.apache.zookeeper.ZooKeeperMain.main(ZooKeeperMain.java:268)
> > > [xpe@t43 zookeeper-3.2.2]$ bin/zkCli.sh
> > >
> > >
> > > Those message nodes are created as CreateMode.PERSISTENT_SEQUENTIAL and
> > are
> > > deleted by the last server who has read them.
> > >
> > > If I remove the troubled server's zookeeper log directory and restart
> the
> > > server, then everything is ok.
> > >
> > > I will try to get the nc result next time I see this problem.
> > >
> > >
> > > Dr Hao He
> > >
> > > XPE - the truly SOA platform
> > >
> > > he@softtouchit.com
> > > http://softtouchit.com
> > > http://itunes.com/apps/Scanmobile
> > >
> > > On 12/08/2010, at 12:32 AM, Mahadev Konar wrote:
> > >
> > > > HI Dr Hao,
> > > >  Can you please post the configuration of all the 3 zookeeper
> servers?
> > I
> > > > suspect it might be misconfigured clusters and they might not belong
> to
> > > the
> > > > same ensemble.
> > > >
> > > > Just to be clear:
> > > >
> /xpe/queues/3bd7851e79381ef4bfd1a5857b5e34c04e5159e5/msgs/msg0000002807
> > > >
> > > > And other such nodes exist on one of the zookeeper servers and the
> same
> > > node
> > > > does not exist on other servers?
> > > >
> > > > Also, as ted pointed out, can you please post the output of echo
> ³stat²
> > |
> > > nc
> > > > localhost 2181 (on all the 3 servers) to the list?
> > > >
> > > > Thanks
> > > > mahadev
> > > >
> > > >
> > > >
> > > > On 8/11/10 12:10 AM, "Dr Hao He" <he@softtouchit.com> wrote:
> > > >
> > > >> hi, Ted,
> > > >>
> > > >> Thanks for the reply.  Here is what I did:
> > > >>
> > > >> [zk: localhost:2181(CONNECTED) 0] ls
> > > >>
> > /xpe/queues/3bd7851e79381ef4bfd1a5857b5e34c04e5159e5/msgs/msg0000002948
> > > >> []
> > > >> zk: localhost:2181(CONNECTED) 1] ls
> > > >> /xpe/queues/3bd7851e79381ef4bfd1a5857b5e34c04e5159e5/msgs
> > > >> [msg0000002807, msg0000002700, msg0000002701, msg0000002804,
> > > msg0000002704,
> > > >> msg0000002706, msg0000002601, msg0000001849, msg0000001847,
> > > msg0000002508,
> > > >> msg0000002609, msg0000001841, msg0000002607, msg0000002606,
> > > msg0000002604,
> > > >> msg0000002809, msg0000002817, msg0000001633, msg0000002812,
> > > msg0000002814,
> > > >> msg0000002711, msg0000002815, msg0000002713, msg0000002716,
> > > msg0000001772,
> > > >> msg0000002811, msg0000001635, msg0000001774, msg0000002515,
> > > msg0000002610,
> > > >> msg0000001838, msg0000002517, msg0000002612, msg0000002519,
> > > msg0000001973,
> > > >> msg0000001835, msg0000001974, msg0000002619, msg0000001831,
> > > msg0000002510,
> > > >> msg0000002512, msg0000002615, msg0000002614, msg0000002617,
> > > msg0000002104,
> > > >> msg0000002106, msg0000001769, msg0000001768, msg0000002828,
> > > msg0000002822,
> > > >> msg0000001760, msg0000002820, msg0000001963, msg0000001961,
> > > msg0000002110,
> > > >> msg0000002118, msg0000002900, msg0000002836, msg0000001757,
> > > msg0000002907,
> > > >> msg0000001753, msg0000001752, msg0000001755, msg0000001952,
> > > msg0000001958,
> > > >> msg0000001852, msg0000001956, msg0000001854, msg0000002749,
> > > msg0000001608,
> > > >> msg0000001609, msg0000002747, msg0000002882, msg0000001743,
> > > msg0000002888,
> > > >> msg0000001605, msg0000002885, msg0000001487, msg0000001746,
> > > msg0000002330,
> > > >> msg0000001749, msg0000001488, msg0000001489, msg0000001881,
> > > msg0000001491,
> > > >> msg0000002890, msg0000001889, msg0000002758, msg0000002241,
> > > msg0000002892,
> > > >> msg0000002852, msg0000002759, msg0000002898, msg0000002850,
> > > msg0000001733,
> > > >> msg0000002751, msg0000001739, msg0000002753, msg0000002756,
> > > msg0000002332,
> > > >> msg0000001872, msg0000002233, msg0000001721, msg0000001627,
> > > msg0000001720,
> > > >> msg0000001625, msg0000001628, msg0000001629, msg0000001729,
> > > msg0000002350,
> > > >> msg0000001727, msg0000002352, msg0000001622, msg0000001726,
> > > msg0000001623,
> > > >> msg0000001723, msg0000001724, msg0000001621, msg0000002736,
> > > msg0000002738,
> > > >> msg0000002363, msg0000001717, msg0000002878, msg0000002362,
> > > msg0000002361,
> > > >> msg0000001611, msg0000001894, msg0000002357, msg0000002218,
> > > msg0000002358,
> > > >> msg0000002355, msg0000001895, msg0000002356, msg0000001898,
> > > msg0000002354,
> > > >> msg0000001996, msg0000001990, msg0000002093, msg0000002880,
> > > msg0000002576,
> > > >> msg0000002579, msg0000002267, msg0000002266, msg0000002366,
> > > msg0000001901,
> > > >> msg0000002365, msg0000001903, msg0000001799, msg0000001906,
> > > msg0000002368,
> > > >> msg0000001597, msg0000002679, msg0000002166, msg0000001595,
> > > msg0000002481,
> > > >> msg0000002482, msg0000002373, msg0000002374, msg0000002371,
> > > msg0000001599,
> > > >> msg0000002773, msg0000002274, msg0000002275, msg0000002270,
> > > msg0000002583,
> > > >> msg0000002271, msg0000002580, msg0000002067, msg0000002277,
> > > msg0000002278,
> > > >> msg0000002376, msg0000002180, msg0000002467, msg0000002378,
> > > msg0000002182,
> > > >> msg0000002377, msg0000002184, msg0000002379, msg0000002187,
> > > msg0000002186,
> > > >> msg0000002665, msg0000002666, msg0000002381, msg0000002382,
> > > msg0000002661,
> > > >> msg0000002662, msg0000002663, msg0000002385, msg0000002284,
> > > msg0000002766,
> > > >> msg0000002282, msg0000002190, msg0000002599, msg0000002054,
> > > msg0000002596,
> > > >> msg0000002453, msg0000002459, msg0000002457, msg0000002456,
> > > msg0000002191,
> > > >> msg0000002652, msg0000002395, msg0000002650, msg0000002656,
> > > msg0000002655,
> > > >> msg0000002189, msg0000002047, msg0000002658, msg0000002659,
> > > msg0000002796,
> > > >> msg0000002250, msg0000002255, msg0000002589, msg0000002257,
> > > msg0000002061,
> > > >> msg0000002064, msg0000002585, msg0000002258, msg0000002587,
> > > msg0000002444,
> > > >> msg0000002446, msg0000002447, msg0000002450, msg0000002646,
> > > msg0000001501,
> > > >> msg0000002591, msg0000002592, msg0000001503, msg0000001506,
> > > msg0000002260,
> > > >> msg0000002594, msg0000002262, msg0000002263, msg0000002264,
> > > msg0000002590,
> > > >> msg0000002132, msg0000002130, msg0000002530, msg0000002931,
> > > msg0000001559,
> > > >> msg0000001808, msg0000002024, msg0000001553, msg0000002939,
> > > msg0000002937,
> > > >> msg0000001556, msg0000002935, msg0000002933, msg0000002140,
> > > msg0000001937,
> > > >> msg0000002143, msg0000002520, msg0000002522, msg0000002429,
> > > msg0000002524,
> > > >> msg0000002920, msg0000002035, msg0000001561, msg0000002134,
> > > msg0000002138,
> > > >> msg0000002925, msg0000002151, msg0000002287, msg0000002555,
> > > msg0000002010,
> > > >> msg0000002002, msg0000002290, msg0000001537, msg0000002005,
> > > msg0000002147,
> > > >> msg0000002145, msg0000002698, msg0000001592, msg0000001810,
> > > msg0000002690,
> > > >> msg0000002691, msg0000001911, msg0000001910, msg0000002693,
> > > msg0000001812,
> > > >> msg0000001817, msg0000001547, msg0000002012, msg0000002015,
> > > msg0000002941,
> > > >> msg0000001688, msg0000002018, msg0000002684, msg0000002944,
> > > msg0000001540,
> > > >> msg0000002686, msg0000001541, msg0000002946, msg0000002688,
> > > msg0000001584,
> > > >> msg0000002948]
> > > >>
> > > >> [zk: localhost:2181(CONNECTED) 7] delete
> > > >>
> > /xpe/queues/3bd7851e79381ef4bfd1a5857b5e34c04e5159e5/msgs/msg0000002948
> > > >> Node does not exist:
> > > >>
> > /xpe/queues/3bd7851e79381ef4bfd1a5857b5e34c04e5159e5/msgs/msg0000002948
> > > >>
> > > >> When I performed the same operations on another node, none of those
> > > nodes
> > > >> existed.
> > > >>
> > > >>
> > > >> Dr Hao He
> > > >>
> > > >> XPE - the truly SOA platform
> > > >>
> > > >> he@softtouchit.com
> > > >> http://softtouchit.com
> > > >> http://itunes.com/apps/Scanmobile
> > > >>
> > > >> On 11/08/2010, at 4:38 PM, Ted Dunning wrote:
> > > >>
> > > >>> Can you provide some more information?  The output of some of
the
> > four
> > > >>> letter commands and a transcript of what you are doing would be
> very
> > > >>> helpful.
> > > >>>
> > > >>> Also, there is no way for znodes to exist on one node of a properly
> > > >>> operating ZK cluster and not on either of the other two.  Something
> > has
> > > to
> > > >>> be wrong and I would vote for operator error (not to cast
> aspersions,
> > > it is
> > > >>> just that humans like you and *me* make more errors than ZK does).
> > > >>>
> > > >>> On Tue, Aug 10, 2010 at 11:32 PM, Dr Hao He <he@softtouchit.com>
> > > wrote:
> > > >>>
> > > >>>> hi, All,
> > > >>>>
> > > >>>> I have a 3-host cluster running ZooKeeper 3.2.2.  On one of
the
> > hosts,
> > > >>>> there are a number of nodes that I can "get" and "ls" using
> zkCli.sh
> > .
> > > >>>> However, when I tried to "delete" any of them, I got "Node
does
> not
> > > exist"
> > > >>>> error.    Those nodes do not exist on the other two hosts.
> > > >>>>
> > > >>>> Any idea how we should handle this type of errors and what
might
> > have
> > > >>>> caused this problem?
> > > >>>>
> > > >>>> Dr Hao He
> > > >>>>
> > > >>>> XPE - the truly SOA platform
> > > >>>>
> > > >>>> he@softtouchit.com
> > > >>>> http://softtouchit.com
> > > >>>> http://itunes.com/apps/Scanmobile
> > > >>>>
> > > >>>>
> > > >>
> > > >>
> > > >
> > > >
> > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message