zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Han <h...@apache.org>
Subject Re: Correlate Ephemeral Owner with connected session
Date Mon, 02 Nov 2020 05:26:05 GMT
>> but I'm not sure if the sid is exposed anywhere in the API (if it is, I
haven't found it yet and would appreciate guidance).

The session id can be retrieved through the Stat object passed to various
ZooKeeper APIs (like getData) - once you get a Stat object
call getEphemeralOwner would return the id of the session that owns the
node.

Alternatively, as Raúl pointed out, zk-shell is an excellent tool to obtain
the same information.

I'd also echo what Enrico pointed out on the version upgrades - we had
quite a few ephemeral nodes related bugs and you could hit one of them in
your case.

On Fri, Oct 30, 2020 at 10:46 AM Paul Summermatter <paulrs@me.com.invalid>
wrote:

> Folks,
>
>         When I grep'd the ZK server logs for the session ID, I do see at
> the time that the connection was lost and reset the following message:
>
> "Client attempting to renew session"
>
>         So it looks like this is indeed the issue that the client
> reconnected and kept the same session. I suspect upgrading will not "fix"
> this issue, because it seems this is behaving as designed. I'll need to do
> some more research to understand how I can tell when a reconnect has
> triggered a new session versus resumed the original session. Checking for
> the existence of the znodes after reconnect won't work, because they could
> have been deleted and recreated by another app instance that has picked up
> the work on behalf of the disconnected instance. If I could see the
> client's sid and compare the new connection's sid to the old, I guess I
> could assume I'm still the owner of the znodes if they exist, but I'm not
> sure if the sid is exposed anywhere in the API (if it is, I haven't found
> it yet and would appreciate guidance). It would also be helpful if the
> znode's ephemeral owner ID were exposed in the client API, but I don't see
> that anywhere in the WatchedEvent API. I guess another possibility is that
> I have to append some information onto the znode's path that identifies the
> owner, but that would require a major change in our logic that would
> introduce a lot of additional complexity. Right now, each app will randomly
> try to grab work and register that it is handling the work by creating a
> well known path with the work's unique ID. Successful creation of the path
> means no other app is handling the work.
>
>         If there is an easier way of managing all of this, please let me
> know. The point of using ZooKeeper was to delegate all the messiness of
> managing a distributed system, but if I have to have complicated logic to
> sense disconnects and then check for the existence of ephemeral znodes
> after a reconnect to know whether I'm still the owner of shared work, that
> isn't terribly helpful. Hopefully, I'm missing something obvious that makes
> this much easier.
>
> Paul
>
> > On Oct 30, 2020, at 1:10 PM, Paul Summermatter <paulrs@me.com.INVALID>
> wrote:
> >
> > Enrico,
> >
> >       Thank you very much for the incredibly rapid reply. I just
> discovered that I can indeed correlate the ephemeral owner ID with a
> sessions "sid" using the 'cons' command. I discovered that one of the three
> ZK instances thinks there is a session with that ID.
> >
> >       Do you or anyone else happen to know if ZK has any issues (either
> in the current or older versions) where a session will not be terminated if
> the client reconnects within a relatively short period of time? I don't
> know how exactly ZK identifies a session or whether the ZK client is trying
> to be helpful and attempts to maintain the session when it reconnects by
> providing the prior session ID in the new connection request, preventing
> the ephemeral nodes from being deleted as I want/expect.
> >
> >       Given our lengthy testing cycle and the fact that we're nearing
> the holidays, upgrading ZK won't be possible until next year, but we will
> definitely look into it. My only concern is if this is ZK's expected
> behavior for some reason, upgrading won't solve the issue.
> >
> > Regards,
> > Paul
> >
> >> On Oct 30, 2020, at 12:34 PM, Enrico Olivelli <eolivelli@gmail.com>
> wrote:
> >>
> >> Paul,
> >> do you have a way to upgrade to the latest ZK 3.6.2 ?
> >> many things changed since 3.4.6, it is a pretty old version
> >>
> >> The session is declared as "expired" on the server side, and this will
> in
> >> turn trigger the deletion of the ephemeral nodes, if they aren't deleted
> >> the session is still active from the servers point of view or there is
> some
> >> kind of bug
> >>
> >> Enrico
> >>
> >>
> >> Il giorno ven 30 ott 2020 alle ore 17:30 Paul Summermatter
> >> <paulrs@me.com.invalid> ha scritto:
> >>
> >>> RE: ZooKeeper 3.4.6
> >>>
> >>> All,
> >>>
> >>>       I'm trying to troubleshoot a problem and could use some guidance
> >>> from the experts on ZK administration. I have a cluster of applications
> >>> that share work and that create ephemeral nodes representing the work
> in ZK
> >>> expressly so that, if one application fails, the ephemeral nodes
> should be
> >>> deleted, and the other apps should be able to pick up the work that is
> now
> >>> not being completed by the failed instance.
> >>>
> >>>       Yesterday evening, one application instance suffered from some
> >>> severe memory pressure and had to run multiple stop the world GC
> cycles.
> >>> The pauses appear to have triggered a SessionExpiredException in
> >>> org.apache.zookeeper.ClientCnxn$SendThread.run (I correlated multiple
> >>> "Pause Full" statements in the GC logs with the ZK session timeout in
> the
> >>> application logs). After the timeout, the connection was
> re-established in
> >>> under 1,000ms, but the ephemeral nodes remained in ZooKeeper, leaving
> them
> >>> as orphans. We've seen this behavior before and have had to delete the
> >>> nodes manually using the zkCli.sh utility.
> >>>
> >>>       In an attempt to troubleshoot this issue, I'm trying to correlate
> >>> the ephemeral owner that is listed on a node when you run the 'get'
> command
> >>> with the ID of an active session. Basically, I'm trying to understand
> >>> whether ZK thinks there is still an active session associated with the
> >>> ephemeral node in the hopes that that might lead to an explanation for
> why
> >>> the ZK server didn't seem to recognize the session timeout sensed on
> the
> >>> client that triggered a new connection and would explain why the
> ephemeral
> >>> nodes were not deleted as they should have been when the connection
> dropped.
> >>>
> >>>       I've tried the various four letter commands on the server to see
> >>> if any of them output anything that looks like the ephemeral owner ID
> >>> without any success. Any suggestions/guidance would be greatly
> appreciated.
> >>> Note, right now, upgrading is not an option, but I'm certainly open to
> that
> >>> if there are known issues with ephemeral nodes in 3.4 that are
> addressed in
> >>> newer versions.
> >>>
> >>> Regards,
> >>> Paul
> >
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message