kafka-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tom Lee (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (KAFKA-2063) Bound fetch response size
Date Mon, 12 Oct 2015 11:01:05 GMT

    [ https://issues.apache.org/jira/browse/KAFKA-2063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14952925#comment-14952925

Tom Lee commented on KAFKA-2063:

This sounds familiar. Am I right in thinking something similar could impact broker replica
fetchers given a large number of partitions per broker?

We've observed symptoms of behavior like this in a somewhat badly configured cluster,
with almost 1k partitions per broker where the leader/follower split is roughly 50/50, replica.fetch.max.bytes
is ~20MB & there are some 20 replica fetcher threads.

"Steady state" is more or less fine with an 8GB heap and the live set is generally minimal,
but we see OOMEs during startup as the broker plays "catch up" on partitions its following
and heavy GC during replica reassignments. With a much larger ~20GB heap the OOMEs go away,
but GC is still *very* heavy for several minutes during startup and we see regular promotion
failures despite a huge amount of free space in old gen, (implying large promotions).

>From what I've seen of the code, I have a very strong suspicion a similar scenario
to what you're describing here would also occur in ReplicaFetcherThread/AbstractFetcherThread
as each thread might be pulling data back from dozens of partitions at a time. With replica.fetch.max.bytes=20MB,
all it takes is us being a little behind on say 50 partitions & all of a sudden we need
1GB for a single fetch.

Even with saner values for replica.fetch.max.bytes (say ~5MB), it doesn't get a whole lot
better given the number of partitions involved should a broker fall a little behind due to
e.g. unexpected GC pauses.

Anyway, perhaps it's worth opening a separate ticket for this but I suspect fixing this issue
at the protocol level will let us fix the issue for replica fetchers / brokers too (if it
hasn't already been fixed for replica fetchers in 0.8.2+ by some other means).

> Bound fetch response size
> -------------------------
>                 Key: KAFKA-2063
>                 URL: https://issues.apache.org/jira/browse/KAFKA-2063
>             Project: Kafka
>          Issue Type: Improvement
>            Reporter: Jay Kreps
> Currently the only bound on the fetch response size is max.partition.fetch.bytes * num_partitions.
There are two problems:
> 1. First this bound is often large. You may chose max.partition.fetch.bytes=1MB to enable
messages of up to 1MB. However if you also need to consume 1k partitions this means you may
receive a 1GB response in the worst case!
> 2. The actual memory usage is unpredictable. Partition assignment changes, and you only
actually get the full fetch amount when you are behind and there is a full chunk of data ready.
This means an application that seems to work fine will suddenly OOM when partitions shift
or when the application falls behind.
> We need to decouple the fetch response size from the number of partitions.
> The proposal for doing this would be to add a new field to the fetch request, max_bytes
which would control the maximum data bytes we would include in the response.
> The implementation on the server side would grab data from each partition in the fetch
request until it hit this limit, then send back just the data for the partitions that fit
in the response. The implementation would need to start from a random position in the list
of topics included in the fetch request to ensure that in a case of backlog we fairly balance
between partitions (to avoid first giving just the first partition until that is exhausted,
then the next partition, etc).
> This setting will make the max.partition.fetch.bytes field in the fetch request much
less useful and we  should discuss just getting rid of it.
> I believe this also solves the same thing we were trying to address in KAFKA-598. The
max_bytes setting now becomes the new limit that would need to be compared to max_message
size. This can be much larger--e.g. setting a 50MB max_bytes setting would be okay, whereas
now if you set 50MB you may need to allocate 50MB*num_partitions.
> This will require evolving the fetch request protocol version to add the new field and
we should do a KIP for it.

This message was sent by Atlassian JIRA

View raw message