flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Philip Doctor <philip.doc...@physiq.com>
Subject Re: Flink 1.4.2 -> 1.5.0 upgrade Queryable State debugging
Date Fri, 20 Jul 2018 18:34:36 GMT
Yeah I just went to reproduce this on a fresh environment, I blew away all the Zookeeper data
and the error went away.  I'm running HA JobManager (1 active, 2 standby) and 3 TMs.  I'm
not sure how to fully account for this behavior yet, it looks like I can make this run from
a totally fresh environment, so until I start practicing how to save point + restore a 1.4.2
-> 1.5.0 job, I look to be good for the moment.  Sorry to have bothered you all.


Thanks.

________________________________
From: Dawid Wysakowicz <dwysakowicz@apache.org>
Sent: Friday, July 20, 2018 3:09:46 AM
To: Philip Doctor
Cc: user; Kostas Kloudas
Subject: Re: Flink 1.4.2 -> 1.5.0 upgrade Queryable State debugging

Hi Philip,
Could you attach the full stack trace? Are you querying the same job/cluster in both tests?
I am also looping in Kostas, who might know more about changes in Queryable state between
1.4.2 and 1.5.0.
Best,
Dawid

On Thu, 19 Jul 2018 at 22:33, Philip Doctor <philip.doctor@physiq.com<mailto:philip.doctor@physiq.com>>
wrote:
Dear Flink Users,
I'm trying to upgrade to flink 1.5.0, so far everything works except for the Queryable state
client.  Now here's where it gets weird.  I have the client sitting behind a web API so the
rest of our non-java ecosystem can consume it.  I've got 2 tests, one calls my route directly
as a java method call, the other calls the deployed server via HTTP (the difference in the
test intending to flex if the server is properly started, etc).

The local call succeeds, the remote call fails, but the failure isn't what I'd expect (around
the server layer) it's compalining about contacting the oracle for state location:

<very very long stack trace>
 Caused by: org.apache.flink.queryablestate.exceptions.UnknownLocationException: Could not
contact the state location oracle to retrieve the state location.\n\tat org.apache.flink.queryablestate.client.proxy.KvStateClientProxyHandler.getKvStateLookupInfo(KvStateClientProxyHandler.java:228)\n
<etc>

The client call is pretty straightforward:

        return client.getKvState(
            flinkJobID,
            stateDescriptor.queryableStateName,
            key,
            keyTypeInformation,
            stateDescriptor
        )

I've confirmed via logs that I have the exact same key, flink job ID, and queryable state
name.

So I'm going bonkers on what difference might exist, I'm wondering if I'm packing my jar wrong
and there's some resources I need to look out for? (back on flink 1.3.x I had to handle the
reference.conf file for AKKA when you were depending on that for Queryable State, is there
something like that? etc).  Is there /more logging/ somewhere on the server side that might
give me a hint?  like "Tried to query state for X but couldn't find the BananaLever" ?  I'm
pretty stuck right now and ready to try any random ideas to move forward.


The only changes I've made aside from the jar version bump from 1.4.2 to 1.5.0 is handling
queryableStateName (which is now nullable), no other code changes were required, and to confirm,
this all works just fine with 1.4.2.


Thank you.

Mime
View raw message