kafka-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ismael Juma <ism...@juma.me.uk>
Subject Re: [DISCUSS] KIP-79 - ListOffsetRequest v1 and offsetForTime() method in new consumer.
Date Fri, 09 Sep 2016 15:06:29 GMT
Our inconsistent usage of magic values like `-1` and `null` to represent
missing values makes it hard to reason about code. We had a recent major
regression in MirrorMaker as a result of this[1], for example. `null` by
itself is hard enough (often called the billion dollar mistake), so can we
decide what we prefer and stick with that going forward?

My take is: `Option`/`Optional` > `null` > `-1` unless it's a particularly
performance sensitive area where the boxing incurred by having `null` as a
possible value is a problem. Since we're still on Java 7, we can't use
`Optional` in Java code.

Ismael

[1] The fix:
https://github.com/apache/kafka/commit/4e4e2fb5085758ee9ccf6307433ad531a33198d3

On Fri, Sep 9, 2016 at 3:48 PM, Jun Rao <jun@confluent.io> wrote:

> Jiangjie,
>
> Returning TimestampOffset(-1, -1) sounds reasonable to me.
>
> Thanks,
>
> Jun
>
> On Thu, Sep 8, 2016 at 8:46 PM, Becket Qin <becket.qin@gmail.com> wrote:
>
> > Hi Jun,
> >
> > > 1. latestOffsets() returns the next offset. So there won't be a
> timestamp
> > > associated with it. Would we use something like -1 for timestamp?
> >
> > The returned value would high watermark, so there might be an associated
> > timestamp. But if it is log end offset, it seems that -1 is a reasonable
> > value.
> >
> > > 2. Jason mentioned that if no message has timestamp >= the provided
> > > timestamp, we return a null value for that partition. Could we document
> > > that in the wiki?
> >
> > I made a minor change to return a TimestampOffset(-1, -1) in that case.
> Not
> > sure which on is better but there seems only minor difference. What do
> you
> > think?
> >
> > I haven't seen a planned release date yet, but I can probably get it done
> > in 2-3 weeks with reasonable rounds of reviews.
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> >
> > On Thu, Sep 8, 2016 at 6:24 PM, Jun Rao <jun@confluent.io> wrote:
> >
> > > Hi, Jiangjie,
> > >
> > > Thanks for the updated KIP. A couple of minor comments.
> > >
> > > 1. latestOffsets() returns the next offset. So there won't be a
> timestamp
> > > associated with it. Would we use something like -1 for timestamp?
> > >
> > > 2. Jason mentioned that if no message has timestamp >= the provided
> > > timestamp, we return a null value for that partition. Could we document
> > > that in the wiki?
> > >
> > > BTW, we are getting close to the next release. This is a really nice
> > > feature to have. Do you think you will have a patch ready for the next
> > > release?
> > >
> > > Thanks.
> > >
> > > Jun
> > >
> > > On Wed, Sep 7, 2016 at 2:47 PM, Becket Qin <becket.qin@gmail.com>
> wrote:
> > >
> > > > That sounds reasonable to me. I'll update the KIP wiki page.
> > > >
> > > > On Wed, Sep 7, 2016 at 1:34 PM, Jason Gustafson <jason@confluent.io>
> > > > wrote:
> > > >
> > > > > Hey Becket,
> > > > >
> > > > > I don't have a super strong preference, but I think this
> > > > >
> > > > > earliestOffset(singleton(partition));
> > > > >
> > > > > captures the intent more clearly than this:
> > > > >
> > > > > offsetsForTimes(singletonMap(partition, -1));
> > > > >
> > > > > I can understand the desire to keep the API footprint small, but
I
> > > think
> > > > > the use case is common enough to justify separate APIs. A couple
> > > > additional
> > > > > points:
> > > > >
> > > > > 1. If we had separate methods, it might make sense to treat
> negative
> > > > > timestamps as illegal in offsetsForTimes. That seems safer from the
> > > user
> > > > > perspective since legitimate timestamps should always be positive.
> > > > > 2. The expected behavior of offsetsForTimes is to return the
> earliest
> > > > > offset which is greater than or equal to the passed offset, so
> having
> > > > > Long.MAX_VALUE return the latest value doesn't seem very intuitive
> to
> > > > me. I
> > > > > would actually expect it to return null.
> > > > >
> > > > > Given that, I think I prefer having the custom methods. What do you
> > > > think?
> > > > >
> > > > > Thanks,
> > > > > Jason
> > > > >
> > > > > On Wed, Sep 7, 2016 at 1:00 PM, Becket Qin <becket.qin@gmail.com>
> > > wrote:
> > > > >
> > > > > > Hi Jason,
> > > > > >
> > > > > > Thanks for the feedback. That is a good point. For the -1 and
-2
> > > > > semantics,
> > > > > > I was just thinking we will preserve the semantics in the wire
> > > > protocol.
> > > > > > For the user facing API, I agree that is not intuitive. We can
do
> > one
> > > > of
> > > > > > the following:
> > > > > > 1. Add two separate methods: earliestOffsets() and
> latestOffsets().
> > > > > > 2. just have offsetsForTimes() and return the earliest if the
> > > timestamp
> > > > > is
> > > > > > negative and the latest if the timestamp is Long.MAX_VALUE.
> > > > > >
> > > > > > The good thing about doing (1) is that we kind of have symmetric
> > > > function
> > > > > > signatures like seekToBeginning() and seekToEnd(). However,
even
> if
> > > we
> > > > do
> > > > > > (1), we may still need to do (2) to handle the negative timestamp
> > and
> > > > the
> > > > > > Long.MAX_VALUE timestamp in offsetsForTimes(). Then they
> > essentially
> > > > > become
> > > > > > redundant to earliestOffsets() and latestOffsets().
> > > > > >
> > > > > > Personally I prefer option (2) because of the conciseness and
it
> > > seems
> > > > > > intuitive enough. But I am open to option (1) as well.
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > Jiangjie (Becket) Qin
> > > > > >
> > > > > > On Wed, Sep 7, 2016 at 11:06 AM, Jason Gustafson <
> > jason@confluent.io
> > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Hey Becket,
> > > > > > >
> > > > > > > Thanks for the KIP. As I understand, the intention is to
> preserve
> > > the
> > > > > > > current behavior with a timestamp of -1 indicating latest
> > timestamp
> > > > and
> > > > > > -2
> > > > > > > indicating earliest timestamp. So users can query these
offsets
> > > using
> > > > > the
> > > > > > > offsetsForTimes API if they know the magic values. I'm
> wondering
> > if
> > > > it
> > > > > > > would make the usage a little nicer to have a separate
API
> > instead
> > > > for
> > > > > > > these special cases? Sort of in the way that we expose
a
> generic
> > > > seek()
> > > > > > and
> > > > > > > a seekToBeginning(), maybe we could have an earliestOffset()
in
> > > > > addition
> > > > > > to
> > > > > > > offsetsForTimes()?
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Jason
> > > > > > >
> > > > > > > On Wed, Sep 7, 2016 at 10:04 AM, Becket Qin <
> > becket.qin@gmail.com>
> > > > > > wrote:
> > > > > > >
> > > > > > > > Thanks everyone for all the feedback.
> > > > > > > >
> > > > > > > > If there is no further concerns or comments I will
start a
> > voting
> > > > > > thread
> > > > > > > on
> > > > > > > > this KIP tomorrow.
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > >
> > > > > > > > Jiangjie (Becket) Qin
> > > > > > > >
> > > > > > > > On Tue, Sep 6, 2016 at 9:48 AM, Becket Qin <
> > becket.qin@gmail.com
> > > >
> > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi Magnus,
> > > > > > > > >
> > > > > > > > > Thanks for the comments. I agree that querying
messages
> > within
> > > a
> > > > > time
> > > > > > > > > range is a valid use case (actually this is an
example use
> > case
> > > > in
> > > > > my
> > > > > > > > > previous email). The current proposal can achieve
this by
> > > having
> > > > > two
> > > > > > > > > ListOffsetRequest, right? I think the current
API already
> > > > supports
> > > > > > the
> > > > > > > > use
> > > > > > > > > cases that require the offsets for multiple timestamps.
The
> > > > > question
> > > > > > is
> > > > > > > > > that whether it is worth adding more complexity
to the
> > protocol
> > > > to
> > > > > > make
> > > > > > > > it
> > > > > > > > > easier for multiple timestamp query. Personally
I think
> given
> > > > that
> > > > > > > query
> > > > > > > > > multiple timestamps is likely an infrequent operation,
> there
> > is
> > > > no
> > > > > > need
> > > > > > > > to
> > > > > > > > > optimize for it and complicates the protocol.
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > >
> > > > > > > > > Jiangjie (Becket) Qin
> > > > > > > > >
> > > > > > > > > On Mon, Sep 5, 2016 at 11:21 PM, Magnus Edenhill
<
> > > > > magnus@edenhill.se
> > > > > > >
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > >> Good write-up Qin, the API looks promising.
> > > > > > > > >>
> > > > > > > > >> I have one comment:
> > > > > > > > >>
> > > > > > > > >> 2016-09-03 5:20 GMT+02:00 Becket Qin <
> becket.qin@gmail.com
> > >:
> > > > > > > > >>
> > > > > > > > >> > The currently offsetsForTimes() API
obviously does not
> > > support
> > > > > > > > querying
> > > > > > > > >> > multiple timestamps for the same partition.
It doesn't
> > > seems a
> > > > > > > feature
> > > > > > > > >> for
> > > > > > > > >> > ListOffsetRequest v0 either (sounds
more like a bug). My
> > > > > intuition
> > > > > > > is
> > > > > > > > >> that
> > > > > > > > >> > it's a rare use case. Given it does
not exist before and
> > we
> > > > > don't
> > > > > > > see
> > > > > > > > a
> > > > > > > > >> > strong need from the community either,
maybe it is
> better
> > to
> > > > > keep
> > > > > > it
> > > > > > > > >> simple
> > > > > > > > >> > for ListOffsetRequest v1. We can add
it later if it
> turns
> > > out
> > > > to
> > > > > > be
> > > > > > > a
> > > > > > > > >> > useful feature (that may need a interface
change, but I
> > > > honestly
> > > > > > do
> > > > > > > > not
> > > > > > > > >> > think people would frequently query
many different
> > > timestamps
> > > > > for
> > > > > > > the
> > > > > > > > >> same
> > > > > > > > >> > partition)
> > > > > > > > >> >
> > > > > > > > >>
> > > > > > > > >> I argue that the current behaviour of OffsetRequest
with
> > > regards
> > > > > to
> > > > > > > > >> duplicate partitions is a bug
> > > > > > > > >> and think it would be a mistake to move the
same semantics
> > > over
> > > > to
> > > > > > > thew
> > > > > > > > >> new
> > > > > > > > >> ListOffset API.
> > > > > > > > >> One use case is that an application may want
to know the
> > > offset
> > > > > > range
> > > > > > > > >> between two timestamps,
> > > > > > > > >> e.g., for reprocessing, batching, searching,
etc.
> > > > > > > > >>
> > > > > > > > >>
> > > > > > > > >> Thanks,
> > > > > > > > >> Magnus
> > > > > > > > >>
> > > > > > > > >>
> > > > > > > > >>
> > > > > > > > >> >
> > > > > > > > >> > Have a good long weekend!
> > > > > > > > >> >
> > > > > > > > >> > Thanks,
> > > > > > > > >> >
> > > > > > > > >> > Jiangjie (Becket) Qin
> > > > > > > > >> >
> > > > > > > > >> >
> > > > > > > > >> >
> > > > > > > > >> >
> > > > > > > > >> > On Fri, Sep 2, 2016 at 6:10 PM, Ismael
Juma <
> > > > ismael@juma.me.uk>
> > > > > > > > wrote:
> > > > > > > > >> >
> > > > > > > > >> > > Thanks for the proposal Becket.
Looks good overall, a
> > few
> > > > > > > comments:
> > > > > > > > >> > >
> > > > > > > > >> > > ListOffsetResponse => [TopicName
[PartitionOffsets]]
> > > > > > > > >> > > >   PartitionOffsets => Partition
ErrorCode Timestamp
> > > > [Offset]
> > > > > > > > >> > > >   Partition => int32
> > > > > > > > >> > > >   ErrorCode => int16
> > > > > > > > >> > > >   Timestamp => int64
> > > > > > > > >> > > >   Offset => int
> > > > > > > > >> > >
> > > > > > > > >> > >
> > > > > > > > >> > > It should be int64 for `Offset`
right?
> > > > > > > > >> > >
> > > > > > > > >> > > Implementation wise, we will migrate
to
> > > > o.a.k.common.requests.
> > > > > > > > >> > > ListOffsetRequest
> > > > > > > > >> > > > class on the broker side.
> > > > > > > > >> > >
> > > > > > > > >> > >
> > > > > > > > >> > > Could you clarify what you mean
here? We already
> > > > > > > > >> > > use o.a.k.common.requests.ListOffsetRequest
in
> > KafkaApis.
> > > > > > > > >> > >
> > > > > > > > >> > > long offset = consumer.offsetForTime(
> > > > > Collections.singletonMap(
> > > > > > > > >> > > topicPartition,
> > > > > > > > >> > > > targetTime)).offset;
> > > > > > > > >> > >
> > > > > > > > >> > >
> > > > > > > > >> > > The result of `offsetForTime` is
a Map, so we can't
> just
> > > > call
> > > > > > > > >> `offset` on
> > > > > > > > >> > > it. You probably meant something
like:
> > > > > > > > >> > >
> > > > > > > > >> > > long offset = consumer.offsetForTime(
> > > > > Collections.singletonMap(
> > > > > > > > >> > > topicPartition,
> > > > > > > > >> > > targetTime)).get(topicPartition).offset;
> > > > > > > > >> > >
> > > > > > > > >> > > Test searchByTimestamp with CreateTime
and
> LogAppendTime
> > > > > > > > >> > > >
> > > > > > > > >> > >
> > > > > > > > >> > > Do you mean `Test offsetForTime`?
> > > > > > > > >> > >
> > > > > > > > >> > > And:
> > > > > > > > >> > >
> > > > > > > > >> > > 1. In KAFKA-1588, the following
issue was described
> > "When
> > > > > > > performing
> > > > > > > > >> an
> > > > > > > > >> > > OffsetRequest, if you request the
same topic and
> > partition
> > > > > > > > combination
> > > > > > > > >> > in a
> > > > > > > > >> > > single request more than once (for
example, if you
> want
> > to
> > > > get
> > > > > > > both
> > > > > > > > >> the
> > > > > > > > >> > > head and tail offsets for a partition
in the same
> > > request),
> > > > > you
> > > > > > > will
> > > > > > > > >> get
> > > > > > > > >> > a
> > > > > > > > >> > > response for both, but they will
be the same offset".
> > Will
> > > > the
> > > > > > new
> > > > > > > > >> > request
> > > > > > > > >> > > version support the use case where
multiple timestamps
> > are
> > > > > > passed
> > > > > > > > for
> > > > > > > > >> the
> > > > > > > > >> > > same topic partition? And if we
do support it at the
> > > > protocol
> > > > > > > level,
> > > > > > > > >> do
> > > > > > > > >> > we
> > > > > > > > >> > > also want to support it at the
API level or do we
> think
> > > the
> > > > > > > > additional
> > > > > > > > >> > > complexity is not worth it?
> > > > > > > > >> > >
> > > > > > > > >> > > 2. Is `offsetForTime` the right
method name given that
> > we
> > > > are
> > > > > > > > getting
> > > > > > > > >> > > multiple offsets? Maybe it should
be `offsetsForTimes`
> > or
> > > > > > > something
> > > > > > > > >> like
> > > > > > > > >> > > that.
> > > > > > > > >> > >
> > > > > > > > >> > > Ismael
> > > > > > > > >> > >
> > > > > > > > >> > > On Wed, Aug 31, 2016 at 4:38 AM,
Becket Qin <
> > > > > > becket.qin@gmail.com
> > > > > > > >
> > > > > > > > >> > wrote:
> > > > > > > > >> > >
> > > > > > > > >> > > > Hi Kafka devs,
> > > > > > > > >> > > >
> > > > > > > > >> > > > I created KIP-79 to allow
consumer to precisely
> query
> > > the
> > > > > > > offsets
> > > > > > > > >> based
> > > > > > > > >> > > on
> > > > > > > > >> > > > timestamp.
> > > > > > > > >> > > >
> > > > > > > > >> > > > In short we propose to :
> > > > > > > > >> > > > 1. add a ListOffsetRequest/ListOffsetResponse
v1,
> and
> > > > > > > > >> > > > 2. add an offsetForTime()
method in new consumer.
> > > > > > > > >> > > >
> > > > > > > > >> > > > The KIP wiki is the following:
> > > > > > > > >> > > > https://cwiki.apache.org/confluence/pages/viewpage.
> > > > > > > > >> > > action?pageId=65868090
> > > > > > > > >> > > >
> > > > > > > > >> > > > Comments are welcome.
> > > > > > > > >> > > >
> > > > > > > > >> > > > Thanks,
> > > > > > > > >> > > >
> > > > > > > > >> > > > Jiangjie (Becket) Qin
> > > > > > > > >> > > >
> > > > > > > > >> > >
> > > > > > > > >> >
> > > > > > > > >>
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message