From user-return-22407-archive-asf-public=cust-asf.ponee.io@flink.apache.org Mon Aug 27 14:17:11 2018 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id E3FDE180674 for ; Mon, 27 Aug 2018 14:17:10 +0200 (CEST) Received: (qmail 10035 invoked by uid 500); 27 Aug 2018 12:17:09 -0000 Mailing-List: contact user-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list user@flink.apache.org Received: (qmail 10025 invoked by uid 99); 27 Aug 2018 12:17:09 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 27 Aug 2018 12:17:09 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 4ABF0C056E for ; Mon, 27 Aug 2018 12:17:09 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.888 X-Spam-Level: * X-Spam-Status: No, score=1.888 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-0.001, SPF_PASS=-0.001, T_DKIMWL_WL_MED=-0.01] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id QLd3w6tPph0o for ; Mon, 27 Aug 2018 12:17:07 +0000 (UTC) Received: from mail-lf1-f68.google.com (mail-lf1-f68.google.com [209.85.167.68]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 578245F2FE for ; Mon, 27 Aug 2018 12:17:07 +0000 (UTC) Received: by mail-lf1-f68.google.com with SMTP id r4-v6so11730568lff.12 for ; Mon, 27 Aug 2018 05:17:07 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=4QyF92cspzZoMagpk3tW9gJ3Ve3N/OeA6ZXAihtrmzM=; b=TLDLsY4Z/AHR2a8w9XzgCkNjvKXFWHjSzJZ39jH5Z3vs4Ni87v0GCH59xxYc3QHC2A aI1kUzNhbYjO0JP9Um9lSUTh+G/m3oCKdYmzhQasbBZTg+JTtch9V8qfB77gOM/WjtLm /uLVv860/eOJMt1Osv83cokMsqDFsqzW120MBvzKdcsHdE8lQ7S2RqVOjzYLMs2UFlyX 2Cy2E5t7acXU9yT4ltXdbHXgx9JYsVu3ifgRy01k3tCbyyMw30oWbZmlFvZI7xk6NEKR QgJdwDt9+yy7kpm6MhztOCOB0lZRlZC+oDId5JRp4awY2AXgCIImfG4F3ddB66LSMZr+ MJJA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=4QyF92cspzZoMagpk3tW9gJ3Ve3N/OeA6ZXAihtrmzM=; b=QTcuVFMLMIYiHx59uLuZJxjHEVRit6HMp/YFM7YkTojM+oCWM3VnPXAyDhm5uB4xY3 B4+rKg5SmFOTJpy/ve6nUVrpDyfTGnqwIJwdZgWwO+OwUrVGr/08G+HQh3w4Ap9JxxPb sVzMjsboQ676Ge0oRUJpnfnQDNLRPnXKMPw8VB8uaK+SeKMmHndSmqETuQuubi6ikPG+ G+gLa9KCZahhf6LuCb3uwGbdK4b7yK751GO6xA6OQFXFtGw2hON3MguSQXT5u7wzUa9D nCVIUfm7huvjyt18tZwOwD8ohPQLUHyJfHRLs73WsuW3ZZibzlI8Ei76AaOo/bPtm2jo cKBg== X-Gm-Message-State: APzg51Cy/cWkuEOz0vWyeWN0MjExmMprnixw/oyhfla2CF3NuICTR8oQ HroSsca+M5V0diRwohVsmvOshx7vUVQpu+v+L3E= X-Google-Smtp-Source: ANB0VdYCe9yf0rsz/iowC7YYSRC92z2marxldWv4+Y63AEsqr0MTibBY7ge+OZVo9X3ltl+A8ZuHsogs1p2Hxq6rqmc= X-Received: by 2002:a19:f015:: with SMTP id p21-v6mr8477830lfc.56.1535372226785; Mon, 27 Aug 2018 05:17:06 -0700 (PDT) MIME-Version: 1.0 References: <46930CC2-B711-4CAD-82A1-0F9582F38DF7@data-artisans.com> In-Reply-To: <46930CC2-B711-4CAD-82A1-0F9582F38DF7@data-artisans.com> From: Pierre Zemb Date: Mon, 27 Aug 2018 14:16:54 +0200 Message-ID: Subject: Re: Question about QueryableState To: Kostas Kloudas Cc: user@flink.apache.org Content-Type: multipart/alternative; boundary="0000000000002cf8ec057469b413" --0000000000002cf8ec057469b413 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Hi! Just created the JIRA (https://issues.apache.org/jira/browse/FLINK-10225). Thanks for your reply, Pierre Le jeu. 23 ao=C3=BBt 2018 =C3=A0 14:31, Kostas Kloudas a =C3=A9crit : > Hi Pierre, > > You are right that this should not happen. > It seems like a bug. > Could you open a JIRA and post it here? > > Thanks, > Kostas > > > On Aug 21, 2018, at 9:35 PM, Pierre Zemb > wrote: > > Hi! > > I=E2=80=99ve started to deploy a small Flink cluster (4tm and 1jm for now= on > 1.6.0), and deployed a small job on it. Because of the current load, job = is > completely handled by a single tm. I=E2=80=99ve created a small proxy tha= t is using > QueryableStateClient > > to access the current state. It is working nicely, except under certain > circumstances. It seems to me that I can only access the state through a > node that is holding a part of the job. Here=E2=80=99s an example: > > - job on tm1. Pointing QueryableStateClient to tm1. State accessible > - job still on tm1. Pointing QueryableStateClient to tm2 (for > example). State inaccessible > - killing tm1, job is now on tm2. State accessible > - job still on tm2. Pointing QueryableStateClient to tm3. State > inaccessible > - adding some parallelism to spread job on tm1 and tm2. Pointing > QueryableStateClient to either tm1 and tm2 is working > - job still on tm1 and tm2. Pointing QueryableStateClient to tm3. > State inaccessible > > When the state is inaccessible, I can see this (generated here > > ): > > java.lang.RuntimeException: Failed request 0. > Caused by: org.apache.flink.queryablestate.exceptions.UnknownLocationExc= eption: Could not retrieve location of state=3Drepo-status of job=3D3ac3bc0= 0b2d5bc0752917186a288d40a. Potential reasons are: i) the state is not ready= , or ii) the job does not exist. > at org.apache.flink.queryablestate.client.proxy.KvStateClientProxyHan= dler.getKvStateLookupInfo(KvStateClientProxyHandler.java:228) > at org.apache.flink.queryablestate.client.proxy.KvStateClientProxyHan= dler.getState(KvStateClientProxyHandler.java:162) > at org.apache.flink.queryablestate.client.proxy.KvStateClientProxyHan= dler.executeActionAsync(KvStateClientProxyHandler.java:129) > at org.apache.flink.queryablestate.client.proxy.KvStateClientProxyHan= dler.handleRequest(KvStateClientProxyHandler.java:119) > at org.apache.flink.queryablestate.client.proxy.KvStateClientProxyHan= dler.handleRequest(KvStateClientProxyHandler.java:63) > at org.apache.flink.queryablestate.network.AbstractServerHandler$Asyn= cRequestTask.run(AbstractServerHandler.java:236) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java= :511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecut= or.java:1142) > at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecu= tor.java:617) > at java.lang.Thread.run(Thread.java:745) > > From the documentation, I can see that: > > The client connects to a Client Proxy running on a given Task Manager. Th= e > proxy is the entry point of the client to the Flink cluster. It forwards > the requests of the client to the Job Manager and the required Task > Manager, and forwards the final response back the client. > > Did I miss something? Is the QueryableStateClientProxy only fetching info > from a job that is running on his local tm? If so, is there a way to > retrieve the job-graph? Or maybe another solution? > > Thanks! > Pierre Zemb > =E2=80=8B > -- > Cordialement, > Pierre Zemb > pierrezemb.fr > Software Engineer, Metrics Data Platform @OVH > > > -- Cordialement, Pierre Zemb pierrezemb.fr Software Engineer, Metrics Data Platform @OVH --0000000000002cf8ec057469b413 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hi!

Thanks for your reply,<= /div>
Pierre

Le=C2=A0jeu. 23 ao=C3=BBt 2018 =C3=A0=C2=A014:31, Kostas Kloudas <<= a href=3D"mailto:k.kloudas@data-artisans.com">k.kloudas@data-artisans.com> a =C3=A9crit=C2=A0:
Hi Pierre,
You are right that this should not happen.
It seems l= ike a bug.
Could you open a JIRA and post it here?

=
Thanks,
Kostas


On Aug 21, 2018, at 9:35 PM, Pierre Zemb <pierre.zemb.isen@gmail.com= > wrote:

Hi!

I=E2=80=99ve started to deploy a small Flink = cluster (4tm and 1jm for now on 1.6.0), and deployed a small job on it. Bec= ause of the current load, job is completely handled by a single tm. I=E2=80= =99ve created a small proxy that is using QueryableStateClient= to access the current state. It is working nicely, except under certain ci= rcumstances. It seems to me that I can only access the state through a node= that is holding a part of the job. Here=E2=80=99s an example:

  • jo= b on tm1. Pointing QueryableStateClient to tm1. State accessible
  • jo= b still on tm1. Pointing QueryableStateClient to tm2 (for example). State i= naccessible
  • ki= lling tm1, job is now on tm2. State accessible
  • jo= b still on tm2. Pointing QueryableStateClient to tm3. State inaccessible
  • ad= ding some parallelism to spread job on tm1 and tm2. Pointing QueryableState= Client to either tm1 and tm2 is working
  • jo= b still on tm1 and tm2. Pointing QueryableStateClient to tm3. State inacces= sible

When the state is inaccess= ible, I can see this (generated here):

java.lang.RuntimeExc=
eption: Failed request 0.
 Caused by: org.apache.flink.queryablestate.exceptions.UnknownLocationExcep=
tion: Could not retrieve location of state=3Drepo-status of job=3D3ac3bc00b=
2d5bc0752917186a288d40a. Potential reasons are: i) the state is not ready, =
or ii) the job does not exist.
    at org.apache.flink.queryablestate.client.proxy.KvStateClientProxyHandl=
er.getKvStateLookupInfo(KvStateClientProxyHandler.java:228)
    at org.apache.flink.queryablestate.client.proxy.KvStateClientProxyHandl=
er.getState(KvStateClientProxyHandler.java:162)
    at org.apache.flink.queryablestate.client.proxy.KvStateClientProxyHandl=
er.executeActionAsync(KvStateClientProxyHandler.java:129)
    at org.apache.flink.queryablestate.client.proxy.KvStateClientProxyHandl=
er.handleRequest(KvStateClientProxyHandler.java:119)
    at org.apache.flink.queryablestate.client.proxy.KvStateClientProxyHandl=
er.handleRequest(KvStateClientProxyHandler.java:63)
    at org.apache.flink.queryablestate.network.AbstractServerHandler$AsyncR=
equestTask.run(AbstractServerHandler.java:236)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:5=
11)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor=
.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecuto=
r.java:617)
    at java.lang.Thread.run(Thread.java:745)

From the documenta= tion, I can see that:

The client connects to a Client Proxy running on a g= iven Task Manager. The proxy is the entry point of the client to the Flink = cluster. It forwards the requests of the client to the Job Manager and the = required Task Manager, and forwards the final response back the client.

Did I miss somethi= ng? Is the QueryableStateClientProxy only fetching info from a job that is = running on his local tm? If so, is there a way to retrieve the job-graph? O= r maybe another solution?

Thanks!
Pierre Zemb

=E2=80=8B
--
Cordialement,
Pierre Zemb
Software Engineer, Metrics Data Platform @OVH

--
Cordialement,
Pierre Zemb
Software E= ngineer, Metrics Data Platform @OVH
--0000000000002cf8ec057469b413--