kafka-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paolo Patierno <ppatie...@live.com>
Subject Re: [DISCUSS] KIP-179: Change ReassignPartitionsCommand to use AdminClient
Date Tue, 01 Aug 2017 15:34:56 GMT
Hi,


Regarding the double API I agree on having only one. Compared to an HTTP REST API you could
have an "already executed" (fast) operation so the HTTP response status code could be just
200 OK while a "long running" operation could have an HTTP response status code as 202 ACCEPTED
(but the processing has not completed).


Regarding adding the possibility to alter the topic config through the AlterTopic API, the
current TopicCommand implementation provides a warning on doing this suggesting to use the
ConfigCommand tool. So it would be a step back allowing to do the configs change with the
alter topic as well.


Last thing on the KIP ... the "timeout" field in the AlterTopicRequest is missing in the table
with related description.


Thanks


Paolo Patierno
Senior Software Engineer (IoT) @ Red Hat
Microsoft MVP on Windows Embedded & IoT
Microsoft Azure Advisor

Twitter : @ppatierno<http://twitter.com/ppatierno>
Linkedin : paolopatierno<http://it.linkedin.com/in/paolopatierno>
Blog : DevExperience<http://paolopatierno.wordpress.com/>


________________________________
From: Tom Bentley <t.j.bentley@gmail.com>
Sent: Tuesday, August 1, 2017 2:22 PM
To: dev@kafka.apache.org
Subject: Re: [DISCUSS] KIP-179: Change ReassignPartitionsCommand to use AdminClient

Hi Ismael,

Thanks for taking the time to look at this.

Currently the proposal around the ReplicaStatusRequest/Response just allows
you to see the lag for the given replicas. It's not something that's tied
to a prior request to alter the topic at all, though obviously you can use
it to monitor how a reassignment is progressing.

I think having the API independent of AlterTopicsRequest/Response is a good
thing, because people might want to know that independently of a
reassignment, for example. But the fact that you just see the instantaneous
lag makes it less than perfect for monitoring a reassignment. You would
have to make multiple calls to figure out how quickly the lag is
diminishing or growing. But I don't want each partition leader having to
maintain some kind of stateful calculation just in case it is queried for
the state of that partition.

So I think that problem might better be solved in the
kafka-reassign-partitions.sh tool. Let _it_ make multiple requests for
status and figure out how the lag is changing.

>From the PoV of the API presented by the protocol, the proposal has moved
in the direction of making AlterTopicsRequest/Response more and more like
CreateTopicsRequest/Response, and I think it has improved as a result of
this.

Personally I don't think we should split the AlterTopicsRequest/Response
API in two, one which we expect to be short running and another which we
expect to be long running. I think that would make the API less symmetric,
and thus less easily understood. What might make more sense is to use the
return code to indicate the distinction between, "OK, it's done completely"
and "OK, it's been started, but will finish asynchronously" -- In a way
that distinction already exists via the timeout in AlterTopicsOptions. In
other words a configs change might return NONE, but a reassignment (with a
short timeout) might return REQUEST_TIMED_OUT, though I guess the problem
with that is that TimeoutException is retriable, and REQUEST_TIMED_OUT
doesn't suggest that you can use another API for query for completion, so
perhaps a dedicated status code is warranted (ASYNCHRONOUS_COMPLETION?)

To be honest, if we added the ability to update the topic configs, it would
be even more symmetric, and I wouldn't be opposed to that, except that I
dislike having two code paths for achieving the same thing, and we already
have AlterConfigsRequest and Response.

What do other people think?

Cheers,

Tom


On 1 August 2017 at 14:25, Ismael Juma <ismael@juma.me.uk> wrote:

> Hi Tom,
>
> A high-level point for discussion before going into the details. The
> proposed protocol API `alterTopics` has 2 types of operations:
>
> 1. Operations that cause data movement (or deletion): increase/decrease of
> replication factor and partition reassignment. These are currently done by
> `kafka-reassign-partitions.sh` and having a way to check the progress is
> useful.
>
> 2. Operations that don't involve data movement: increase the number of
> partitions and update topic configs (this is not in the proposal, but could
> be). These are currently done by `kafka-topics.sh` and they don't usually
> take very long so progress reporting is less important.
>
> It would be worth thinking whether having 2 separate protocol APIs would be
> better. I can see pros and cons, so I'd be interested in what you and
> others think.
>
> Ismael
>
> On Tue, Aug 1, 2017 at 10:19 AM, Tom Bentley <t.j.bentley@gmail.com>
> wrote:
>
> > I have added the timeout I mentioned before, because it makes the
> > implementation of topic alteration more symmetric with the topic creation
> > APIs.
> >
> > I have also added a section ("Policy") on retrofitting the
> > CreateTopicPolicy's rules to topic alteration requests, and made a few
> > other minor fixes.
> >
> > I've still not factored out the topic name from the ReplicaStatusRequest
> > (Ismael, do you have an opinion about that?)
> >
> > If any one else has some feedback on this KIP that would be great.
> > Otherwise, I would like to start a vote on this KIP before the end of the
> > week.
> >
> > Thanks,
> >
> > Tom
> >
> > On 26 July 2017 at 11:45, Tom Bentley <t.j.bentley@gmail.com> wrote:
> >
> > > I've updated the KIP to fix those niggles, but I've not factored out
> the
> > > topic name from the ReplicaStatusRequest, yet.
> > >
> > > Looking at the topic creation APIs in more detail, the
> > CreateTopicsOptions
> > > has
> > >
> > > * `shouldValidateOnly()`, which would make a lot of sense for the alter
> > > topic APIs
> > > * `timeoutMs()`, which I'm not sure sure about...
> > >
> > > Topic creation doesn't require shifting replicas between brokers so
> it's
> > > reasonable to support timeout, because we don't expect it to take very
> > long.
> > >
> > > Topic alteration usually takes a while because we are going to have to
> > > move replicas. Since we're adding a whole API to track the progress of
> > that
> > > replication, I'm inclined to think that having a timeout is a bit
> > pointless.
> > >
> > > But should the replicaStatus() API have a timeout? I suppose it
> probably
> > > should.
> > >
> > >
> > > On 26 July 2017 at 10:58, Tom Bentley <t.j.bentley@gmail.com> wrote:
> > >
> > >> Thanks Paolo,
> > >>
> > >>   *   in the "Public Interfaces" section you wrote
> > >>> alterTopics(Set<AlterTopics>) but then a collection is used (instead
> > of a
> > >>> set) in the Proposed Changes section. I'm ok with collection.
> > >>>
> > >>
> > >> Agree it should be Collection.
> > >>
> > >>
> > >>>   *   in the summary of the alterTopics method you say "The request
> can
> > >>> change the number of partitions, replication factor and/or the
> > partition
> > >>> assignments." I think that the "and/or" is misleading (at least for
> > me).
> > >>> For the TopicCommand tool you can specify partitions AND replication
> > factor
> > >>> ... OR partition assignments (but not for example partitions AND
> > >>> replication factor AND partition assignments). Maybe it could be "The
> > >>> request can change the number of partitions and the related
> replication
> > >>> factor or specifying a partition assignments."
> > >>>
> > >>
> > >> Is there a reason why we can't support changing all three at once? It
> > >> certainly makes conceptual sense to, say, increase the number of
> > partitions
> > >> and replication factor, and specify how you want the partitions
> > assigned.
> > >> And doing two separate calls would be less efficient as we sync new
> > >> replicas on brokers only to then move them somewhere else.
> > >>
> > >> If there is a reason we don't want to support changing all three, then
> > we
> > >> can return the error INVALID_REQUEST (42). That would allow us to
> > support
> > >> changing all three at some time in the future, without having to
> change
> > the
> > >> API.
> > >>
> > >>
> > >>>   *   I know that it would be a breaking change in the Admin Client
> API
> > >>> but why having an AlteredTopic class which is quite similar to the
> > already
> > >>> existing NewTopic class ? I know that using NewTopic for the alter
> > method
> > >>> could be misleading. What about renaming NewTopic in something like
> > >>> AdminTopic ? At same time it means that the TopicDetails class (which
> > you
> > >>> can get from the current NewTopic) should be outside the
> > >>> CreateTopicsRequest because it could be used in the
> AlterTopicsRequest
> > as
> > >>> well.
> > >>>
> > >>
> > >> One problem with this is it tends to inhibit future API changes for
> > >> either newTopics() or alterTopics(), because any common class needs to
> > make
> > >> sense in both contexts.
> > >>
> > >> For createTopics() we get to specify some configs (the
> > >> Map<String,String>), but since the AdminClient already has
> > alterConfigs()
> > >> for changing topic configs after topic creation I don't think it's
> > right to
> > >> also support changing those configs via alterTopics() as well. But
> > having
> > >> them in a common AdminTopic class would suggest that that was
> supported.
> > >> Yes, alterTopics could return INVALID_REQUEST if it was given topic
> > >> configs, but this is just making the API harder to use since it is
> > >> weakening of the type safety of the API.
> > >>
> > >> I suppose we *could* write a common TopicDetails class and make the
> > >> existing nested one extend the common one, with deprecations, to
> > eventually
> > >> remove the nested one.
> > >>
> > >>
> > >>
> > >>>   *   A typo in the ReplicaStatus : gpartition() instead of
> partition()
> > >>>   *   In the AlterTopicRequets
> > >>>      *   the replication factor should be INT16
> > >>>
> > >>
> > >> Ah, thanks!
> > >>
> > >>
> > >>>      *   I would use same fields name as CreateTopicsRequest (they
> are
> > >>> quite similar)
> > >>>   *   What's the broker id in the ReplicaStatusRequest ?
> > >>>
> > >>
> > >> It's the broker, which is expected to have a replica of the given
> > >> partition, that we're querying the status of. It is necessary because
> > we're
> > >> asking the _leader_ for the partition about (a subset of) the status
> of
> > the
> > >> followers. Put another way, to identify the replica of a particular
> > >> partition on a particular broker we need the tuple (topic, partition,
> > >> broker).
> > >>
> > >> If we were always interested in the status of the partition across all
> > >> brokers we could omit the broker part. But for reassignment we
> actually
> > >> only care about a subset of the brokers.
> > >>
> > >>
> > >>>   *   Thinking aloud, could make sense having "Partition" in the
> > >>> ReplicaStatusRequest as an array so that I can specify in only one
> > request
> > >>> the status for partitions I'm interested in, in order to avoid e
> > request
> > >>> for each partition for the same topic. Maybe empty array could mean
> ..
> > >>> "give me status for all partitions of this topic". Of course it means
> > that
> > >>> the ReplicaStatusResponse should change because we should have an
> array
> > >>> with partition, broker, lag and so on
> > >>>
> > >>
> > >> You already can specify in one request the status for all the
> partitions
> > >> you're interested in (ReplicaStatus can be repeated/is an array
> field).
> > >>
> > >> We could factor out the topic to avoid repeating it, which would be
> more
> > >> efficient when we're querying the status of many partitions of a topic
> > >> and/or there are many brokers holding replicas. In other words, we
> could
> > >> factor it to look like this:
> > >>
> > >> ReplicaStatusRequest => [TopicReplicas]
> > >>   TopicReplicas => Topic [PartitionReplica]
> > >>     Topic => string
> > >>     PartitionReplica => Partition Broker
> > >>       Partition => int32
> > >>       Broker => int32
> > >>
> > >> Does this make sense to you?
> > >>
> > >>
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message