kafka-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "东方甲乙" <254479...@qq.com>
Subject RE: [DISCUSS] KIP-148: Add a connect timeout for client
Date Mon, 29 May 2017 05:34:32 GMT
Hi Guozhang
    I do agree for adding  a universal `connect.timeout.ms`, and in
another KIP consider adding per-request-type timeout overrides

    And is there any other comments about this KIP?  Looking forward to the comments.


Thanks
David




------------------ 原始邮件 ------------------
发件人: "Guozhang Wang";<wangguoz@gmail.com>;
发送时间: 2017年5月24日(星期三) 上午10:07
收件人: "dev@kafka.apache.org"<dev@kafka.apache.org>; 

主题: Re: [DISCUSS] KIP-148: Add a connect timeout for client



I think using a single config to cover end-to-end latency with connecting
and request round-trip may not be best appropriate since 1) some request
may need much more time than others since they are parked (fetch request
with long polling, join group request etc) or throttled, and 2) some
requests are prerequisite of others, like group request to discover the
coordinator before the fetch offset request, and implementation wise these
request send/receive is embedded in latter ones, hence it is not clear if
the `request.timeout.ms` should cover just a single RPC or more.

So no matter whether we add a `connect.timeout.ms` in addition to `
request.timeout.ms`, we should consider adding per-request-type timeout
value, and make `request.timeout.ms` a global default; if we add the `
connect.timeout.ms` the per-request value is only for the round trip,
otherwise it is supposed to include the connecting time. Personally I'd
prefer the first option to add a universal `connect.timeout.ms`, and in
another KIP consider adding per-request-type timeout overrides.

BTW if the consumer issue is the only cause that we are having a high
default value, I'd suggest we separate the consumer rebalance timeout and
not piggy-back on the session timeout. Then we can set the default `
request.timeout.ms` to a smaller value, like 10 secs. This is orthogonal to
this KIP discussion and we can continue this in a separate thread.


Guozhang

On Tue, May 23, 2017 at 3:31 PM, Colin McCabe <cmccabe@apache.org> wrote:

> Another note-- it would be really nice if timeouts were end-to-end,
> rather than being set for particular phases of an RP  From a user point
> of view, a 30 second timeout should mean that the call either succeeds
> or fails after 30 seconds, regardless of how much time is spent looking
> for metadata, connecting to brokers, waiting for brokers, etc.  This is
> implemented in AdminClient by setting a deadline when the call is first
> created and referring to that afterwards.
>
> best,
> Colin
>
>
> On Tue, May 23, 2017, at 13:18, Colin McCabe wrote:
> > In the AdminClient, we allow setting per-call timeouts.  The global
> > timeout is just a default.  It seems like that is really what we should
> > do in the producer and consumer as well, rather than having a lot of
> > special cases for timeouts in  connecting vs. other call states.  Then
> > join requests could gave a 5 minute timeout, but other requests could
> > gave a shorter one.  Thoughts?
> >
> > Cheers,
> > Colin
> >
> > OnTue, May 23, 2017, at 04:27, Rajini Sivaram wrote:
> > > Guozhang,
> > >
> > > At the moment we don't have a connect timeout. And the behaviour
> > > suggested
> > > in the KIP is useful to address this.
> > >
> > > We do however have a request.timeout.ms. This is the amount of time it
> > > would take to detect a crashed broker if the broker crashed after a
> > > connection was established. Unfortunately in the consumer, this was
> > > increased to > 5minutes since JoinRequest can take up to
> > > max.poll.interval.ms, which has a default of  5 minutes. Since the
> > > whole point of this timeout is to detect a crashed broker, 5 minutes is
> > > too
> > > large.
> > >
> > > My suggestion was to use request.timeout.ms to also detect connection
> > > timeouts to a crashed broker - implement the behavior suggested in the
> > > KIP
> > > without adding a new config parameter. As Ismael has said, this will
> need
> > > to fix request.timeout.ms in the consumer.
> > >
> > >
> > > On Mon, May 22, 2017 at 1:23 PM, Simon Souter <
> simons@cakesolutions.net>
> > > wrote:
> > >
> > > > The following tickets are probably relevant to this KIP:
> > > >
> > > > https://issues.apache.org/jira/browse/KAFKA-3457
> > > > https://issues.apache.org/jira/browse/KAFKA-1894
> > > > https://issues.apache.org/jira/browse/KAFKA-3834
> > > >
> > > > On 22 May 2017 at 16:30, Rajini Sivaram <rajinisivaram@gmail.com>
> wrote:
> > > >
> > > > > Ismael,
> > > > >
> > > > > Yes, agree. My concern was that a connection can be shutdown
> uncleanly at
> > > > > any time. If a client is in the middle of a request, then it times
> out
> > > > > after min(request.timeout.ms, tcp-timeout). If we add another
> config
> > > > > option
> > > > > connect.timeout.ms, then we will sometimes wait for min(
> > > > connect.timeout.ms
> > > > > ,
> > > > > tcp-timeout) and sometimes for min(request.timeout.ms,
> tcp-timeout),
> > > > > depending
> > > > > on connection state. One config option feels neater to me.
> > > > >
> > > > > On Mon, May 22, 2017 at 11:21 AM, Ismael Juma <ismael@juma.me.uk>
> wrote:
> > > > >
> > > > > > Rajini,
> > > > > >
> > > > > > For this to have the desired effect, we'd probably need to lower
> the
> > > > > > default request.timeout.ms for the consumer and fix the
> underlying
> > > > > reason
> > > > > > why it is a little over 5 minutes at the moment.
> > > > > >
> > > > > > Ismael
> > > > > >
> > > > > > On Mon, May 22, 2017 at 4:15 PM, Rajini Sivaram <
> > > > rajinisivaram@gmail.com
> > > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Hi David,
> > > > > > >
> > > > > > > Sorry, what I meant was: Can you reuse the existing
> configuration
> > > > > option
> > > > > > > request.timeout,ms , instead of adding a new config and
add the
> > > > > behaviour
> > > > > > > that you have proposed in the KIP for the connection phase
> using this
> > > > > > > timeout? I think the timeout for connection is useful.
I am
> not sure
> > > > we
> > > > > > > need another configuration option to implement it.
> > > > > > >
> > > > > > > Regards,
> > > > > > >
> > > > > > > Rajini
> > > > > > >
> > > > > > >
> > > > > > > On Mon, May 22, 2017 at 11:06 AM, 东方甲乙 <254479818@qq.com>
> wrote:
> > > > > > >
> > > > > > > > Hi Rajini.
> > > > > > > >
> > > > > > > > When kafka node' machine is shutdown or network is
closed,
> the
> > > > > > connecting
> > > > > > > > phase could not use the request.timeout.ms, because
the
> client
> > > > > haven't
> > > > > > > > send a req yet.   And no response for the nio, the
selector
> will
> > > > not
> > > > > > > close
> > > > > > > > the connect, so it will not choose other good node
to get the
> > > > > metadata.
> > > > > > > >
> > > > > > > >
> > > > > > > > Thanks
> > > > > > > > David
> > > > > > > >
> > > > > > > > ------------------ 原始邮件 ------------------
> > > > > > > > *发件人:* "Rajini Sivaram" <rajinisivaram@gmail.com>;
> > > > > > > > *发送时间:* 2017年5月22日(星期一) 20:17
> > > > > > > > *收件人:* "dev" <dev@kafka.apache.org>;
> > > > > > > > *主题:* Re: [DISCUSS] KIP-148: Add a connect timeout
for client
> > > > > > > >
> > > > > > > >
> > > > > > > > Hi David,
> > > > > > > >
> > > > > > > > Is there a reason why you wouldn't want to use
> request.timeout.ms
> > > > as
> > > > > > the
> > > > > > > > timeout parameter for connections? Then you would
use the
> same
> > > > > timeout
> > > > > > > for
> > > > > > > > connected and connecting phases when shutdown is unclean.
> You could
> > > > > > still
> > > > > > > > use the timeout to ensure that next metadata request
is sent
> to
> > > > > another
> > > > > > > > node.
> > > > > > > >
> > > > > > > > Regards,
> > > > > > > >
> > > > > > > > Rajini
> > > > > > > >
> > > > > > > > On Sun, May 21, 2017 at 9:51 AM, 东方甲乙 <254479818@qq.com>
> wrote:
> > > > > > > >
> > > > > > > > > Hi Guozhang,
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Thanks for the clarify. For the clarify 2, I
think the key
> thing
> > > > is
> > > > > > not
> > > > > > > > > users control how much time in maximum to wait
for inside
> code,
> > > > but
> > > > > > is
> > > > > > > > the
> > > > > > > > > network client can be aware of the connecting
can't be
> finished
> > > > and
> > > > > > > try a
> > > > > > > > > good node. In the producer.sender even the selector.poll
> can
> > > > > timeout,
> > > > > > > but
> > > > > > > > > the next time is also not close the previous
connecting
> and try
> > > > > > another
> > > > > > > > > good node.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > In out test env, QA shutdown one of the leader
node, the
> producer
> > > > > > send
> > > > > > > > the
> > > > > > > > > request will timeout and close the node's connection
then
> request
> > > > > the
> > > > > > > > > metadata.  But sometimes the request node is
also the
> shutdown
> > > > > node.
> > > > > > > > When
> > > > > > > > > connecting the shutting down node to get the
metadata, it
> is in
> > > > the
> > > > > > > > > connecting phase, network client mark the connecting
> node's state
> > > > > to
> > > > > > > > > CONNECTING, but if the node is shutdown,  the
socket can't
> be
> > > > aware
> > > > > > of
> > > > > > > > the
> > > > > > > > > connecting is broken. Though the selector.poll
has timeout
> > > > > parameter,
> > > > > > > but
> > > > > > > > > it will not close the connection, so the next
> > > > > > > > > time in the "networkclient.maybeUpdate" it will
check if
> > > > > > > > > isAnyNodeConnecting, then will not connect to
any good
> node the
> > > > get
> > > > > > the
> > > > > > > > > metadata.  It need about several minutes to
> > > > > > > > > aware the connecting is timeout and try other
node.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > So I want to add a connect.timeout parameter,
 the
> selector can
> > > > > find
> > > > > > > the
> > > > > > > > > connecting is timeout and close the connection.
 It seems
> the
> > > > > > currently
> > > > > > > > the
> > > > > > > > > timeout value passed in `selector.poll()`
> > > > > > > > > seems can not do this.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > David
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > ------------------ 原始邮件 ------------------
> > > > > > > > > 发件人: "Guozhang Wang";<wangguoz@gmail.com>;
> > > > > > > > > 发送时间: 2017年5月16日(星期二) 凌晨1:51
> > > > > > > > > 收件人: "dev@kafka.apache.org"<dev@kafka.apache.org>;
> > > > > > > > >
> > > > > > > > > 主题: Re: [DISCUSS] KIP-148: Add a connect
timeout for client
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Hi David,
> > > > > > > > >
> > > > > > > > > I may be a bit confused before, just clarifying
a few
> things:
> > > > > > > > >
> > > > > > > > > 1. As you mentioned, a client will always try
to first
> establish
> > > > > the
> > > > > > > > > connection with a broker node before it tries
to send any
> request
> > > > > to
> > > > > > > it.
> > > > > > > > > And after connection is established, it will
either
> continuously
> > > > > send
> > > > > > > > many
> > > > > > > > > requests (e.g. produce) for just a single request
(e.g.
> metadata)
> > > > > to
> > > > > > > the
> > > > > > > > > broker, so these two phases are indeed different.
> > > > > > > > >
> > > > > > > > > 2. In the connected phase, connections.max.idle.ms
is
> used to
> > > > > > > > > auto-disconnect the socket if no requests has
been sent /
> > > > received
> > > > > > > during
> > > > > > > > > that period of time; in the connecting phase,
we always
> try to
> > > > > create
> > > > > > > the
> > > > > > > > > socket via "socketChannel.connect" in a non-blocking
call,
> and
> > > > then
> > > > > > > > checks
> > > > > > > > > if the connection has been established, but all
the
> callers of
> > > > this
> > > > > > > > > function (in either producer or consumer) has
a timeout
> parameter
> > > > > as
> > > > > > in
> > > > > > > > > `selector.poll()`, and the timeout parameter
is set either
> by
> > > > > > > > calculations
> > > > > > > > > based on metadata.expiration.time and backoff
for
> > > > producer#sender,
> > > > > or
> > > > > > > by
> > > > > > > > > directly passed values from consumer#poll(timeout),
so
> although
> > > > > there
> > > > > > > is
> > > > > > > > no
> > > > > > > > > directly config controlling that, users can still
control
> how
> > > > much
> > > > > > time
> > > > > > > > in
> > > > > > > > > maximum to wait for inside code.
> > > > > > > > >
> > > > > > > > > I originally thought your scenarios is more on
the
> connected
> > > > phase,
> > > > > > but
> > > > > > > > now
> > > > > > > > > I feel you are talking about the connecting phase.
For that
> > > > case, I
> > > > > > > still
> > > > > > > > > feel currently the timeout value passed in
> `selector.poll()`
> > > > which
> > > > > is
> > > > > > > > > controllable from user code should be sufficient?
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Guozhang
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Sun, May 14, 2017 at 2:37 AM, 东方甲乙
<254479818@qq.com>
> wrote:
> > > > > > > > >
> > > > > > > > > > Hi Guozhang,
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Sorry for the delay, thanks for the question.
 It seems
> two
> > > > > > different
> > > > > > > > > > parameters to me:
> > > > > > > > > > connect.timeout.ms: only work for the connecting
> phrase, after
> > > > > > > > connected
> > > > > > > > > > phrase this parameter is not used.
> > > > > > > > > > connections.max.idle.ms: currently not work
in the
> connecting
> > > > > > phrase
> > > > > > > > > > (only select return readyKeys >0) will
add to the expired
> > > > > manager,
> > > > > > > > after
> > > > > > > > > > connected will check if the connection is
still alive in
> some
> > > > > time.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Even if we change the connections.max.idle.ms
to work
> > > > including
> > > > > > the
> > > > > > > > > > connecting phrase, we can not set this parameter
to a
> small
> > > > > value,
> > > > > > > such
> > > > > > > > > as
> > > > > > > > > > 5 seconds. Because the client is maybe busy
sending
> message to
> > > > > > other
> > > > > > > > > node,
> > > > > > > > > > it will be disconnected in 5 seconds, so
the default
> value of
> > > > > > > > > > connections.max.idle.ms is setting to a
larger time. We
> should
> > > > > > have
> > > > > > > > two
> > > > > > > > > > parameters to control the connecting phrase
behavior and
> the
> > > > > > > connected
> > > > > > > > > > phrase behavior, do you think so?
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Thanks,
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > David
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > ------------------ 原始邮件 ------------------
> > > > > > > > > > 发件人: "Guozhang Wang";<wangguoz@gmail.com>;
> > > > > > > > > > 发送时间: 2017年5月6日(星期六)
上午7:52
> > > > > > > > > > 收件人: "dev@kafka.apache.org"<dev@kafka.apache.org>;
> > > > > > > > > >
> > > > > > > > > > 主题: Re: [DISCUSS] KIP-148: Add a connect
timeout for
> client
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Hello David,
> > > > > > > > > >
> > > > > > > > > > Thanks for the KIP. For the described issue,
I'm
> wondering if
> > > > it
> > > > > > can
> > > > > > > be
> > > > > > > > > > resolved by tuning the CONNECTIONS_MAX_IDLE_MS_CONFIG
(
> > > > > > > > > > connections.max.idle.ms) on the client side?
Default is
> 9
> > > > > minutes.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Guozhang
> > > > > > > > > >
> > > > > > > > > > On Tue, May 2, 2017 at 8:22 AM, 东方甲乙
<254479818@qq.com>
> wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi all,
> > > > > > > > > > >
> > > > > > > > > > > Currently in our test environment,
we found that after
> one of
> > > > > the
> > > > > > > > > broker
> > > > > > > > > > > node crash (reboot or os crash), the
client may still
> be
> > > > > > connecting
> > > > > > > > to
> > > > > > > > > > the
> > > > > > > > > > > crash node to send metadata request
or other request,
> and it
> > > > > > needs
> > > > > > > > > > several
> > > > > > > > > > > minutes to be aware that the connection
is timeout
> then try
> > > > > > another
> > > > > > > > > node
> > > > > > > > > > to
> > > > > > > > > > > connect to send the request. Then the
client may still
> not be
> > > > > > aware
> > > > > > > > of
> > > > > > > > > > the
> > > > > > > > > > > metadata change after several minutes.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > So I want to add a connect timeout
on the  client,
> please
> > > > > take a
> > > > > > > > look
> > > > > > > > > > at:
> > > > > > > > > > >
> > > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > > > > > > > > > > 148%3A+Add+a+connect+timeout+for+client
> > > > > > > > > > >
> > > > > > > > > > > Regards,
> > > > > > > > > > >
> > > > > > > > > > > David
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > > -- Guozhang
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > -- Guozhang
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > [image: cake_logo_strap_screen 400.jpg] <
> http://www.cakesolutions.net>
> > > >
> > > > Simon Souter
> > > > (Office) 0845 617 1200
> > > > Houldsworth Mill, Houldsworth Street, Reddish, Stockport, SK5 6DA, UK
> > > > [image: twitter-circle-darkgrey.png]
> > > > <https://twitter.com/cakesolutions> [image:
> > > > facebook-circle-darkgrey.png]
> > > > <https://www.facebook.com/cakesolutionslimited/> [image:
> > > > linkedin-circle-darkgrey.png]
> > > > <https://www.linkedin.com/company/cake-solutions-limited>
> > > > [image: Reactive Applications]
> > > > <https://cakesolutions.sigstr.net/uc/588780e6825be936ed5682e0>
> > > > Company registered in the UK, No. 4184567 If you have received this
> e-mail
> > > > in error, please accept our apologies, destroy it immediately, and
> it would
> > > > be greatly appreciated if you notified the sender. It is your
> > > > responsibility to protect your system from viruses and any other
> harmful
> > > > code or device. We try to eliminate them from e-mails and
> attachments, but
> > > > we accept no liability for any which remain. We may monitor or
> access any
> > > > or all e-mails sent to us.
> > > > [image: Powered by Sigstr]
> > > > <https://cakesolutions.sigstr.net/uc/588780e6825be936ed5682e0/
> watermark>
> > > >
>



-- 
-- Guozhang
Mime
  • Unnamed multipart/alternative (inline, 8-Bit, 0 bytes)
View raw message