kafka-jira mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Peter Davis (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (KAFKA-5285) Optimize upper / lower byte range for key range scan on windowed stores
Date Tue, 20 Feb 2018 02:33:00 GMT

    [ https://issues.apache.org/jira/browse/KAFKA-5285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16369654#comment-16369654

Peter Davis commented on KAFKA-5285:

Noting that after upgrading from to 1.0.1 today, I'm seeing severely degraded performance
of `(ReadOnly)SessionStore.fetch(key)` as well.  Before we were only seeing the problem with
`fetch(from,to)`.  Browsed the source code and I didn't immediately see what changed between
0.11 and 1.0 there.  (Another guess is it's a subtle side effect of some other change like
perhaps https://issues.apache.org/jira/browse/KAFKA-4868 resulting in different compacted
DB levels somehow?)

Anyway, workaround for me is to use `findSessions(key, 0, System.currentTimeMillis() + <some
reasonable time in the future>)`, since the 0x00 bytes in a timestamp < Long.MAX_VALUE
yield a few extra usable bytes of maxKey prefix.

Both `ReadOnlySessionStore.fetch(...)` variants are entirely unusable for me at this time.

> Without any additional information about the key length or or the lower bound, we can
only assume that keys are at least 1 byte, and that byte has to be smaller or equal to the
first byte of keyTo (i.e. our upper bound has to start with the first byte of keyTo), so our
best guess for and upper bound in that case is ADFFF.

Doing a range query with *one byte* of prefix will never give acceptable performance for any
database with more than 8 keys(!), or in use cases where key prefixes are not randomly distributed
(common in business applications).

May I suggest a few options, not mutually exclusive, but in order of preference:

1. Optimize where fromKey and toKey are the same or have a common prefix.  (Isn't that your
minimum key length right there?  I'm not really sure I understand why it's not just this simple.
 Note, this is the only case I personally care about.)

2. Deprecate the `fetch` variants in favor of `findSessions`, and document that using max=Long.MAX_VALUE
is not recommended.  Promote findSessions to ReadOnlySessionStore.  (This at least gives a
few more bytes of usable key prefix.)

3. Configuration for default timeStartLatest = currentTimeMillis() + <reasonable offset
like 1 day>.  (Same benefit as #2)

4. Configure minimum key length.  I don't like this because if natural keys are used (user
names, human-readable business object references like "file number", etc.) then there isn't
necessarily a good minimum key length that can be enforced by the application.  And if there
were, it'd likely vary by store, raising the question of how do you easily configure per-store

> Optimize upper / lower byte range for key range scan on windowed stores
> -----------------------------------------------------------------------
>                 Key: KAFKA-5285
>                 URL: https://issues.apache.org/jira/browse/KAFKA-5285
>             Project: Kafka
>          Issue Type: Improvement
>          Components: streams
>            Reporter: Xavier Léauté
>            Assignee: Guozhang Wang
>            Priority: Major
>              Labels: performance
> The current implementation of {{WindowKeySchema}} / {{SessionKeySchema}} {{upperRange}}
and {{lowerRange}} does not make any assumptions with respect to the other key bound (e.g.
the upper byte bound does not depends on lower key bound).
> It should be possible to optimize the byte range somewhat further using the information
provided by the lower bound.
> More specifically, by incorporating that information, we should be able to eliminate
the corresponding {{upperRangeFixedSize}} and {{lowerRangeFixedSize}}, since the result should
be the same if we implement that optimization.

This message was sent by Atlassian JIRA

View raw message