Mailing-List: contact dev-help@kafka.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@kafka.apache.org
Received-SPF: pass (athena.apache.org: domain of
 prvs=132b54cec=criccomini@linkedin.com designates 69.28.149.81 as permitted
 sender)
From: Chris Riccomini <criccomini@linkedin.com>
To: "users@kafka.apache.org" <users@kafka.apache.org>, "dev@kafka.apache.org"
	<dev@kafka.apache.org>
Subject: Re: New Consumer API discussion
Thread-Topic: New Consumer API discussion
Thread-Index: AQHPJpGvC/mA7vcNc0KxvUH8m7b5iZrPzPSA
Date: Mon, 3 Mar 2014 18:19:03 +0000
Message-ID: <CF3A077B.24166%criccomini@linkedin.com>
In-Reply-To: 
 <CAOG_4QYBHwyi0xN=HL1FpnRTkVfJZX14uJFntfT3nn_Mw3+XgQ@mail.gmail.com>
Accept-Language: en-US
Content-Language: en-US
user-agent: Microsoft-MacOutlook/14.3.6.130613
Content-Type: text/plain; charset="us-ascii"
Content-ID: <D66D64D197C95043B769B2862E065711@linkedin.com>
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0

Hey Guys,

Sorry for the late follow up. Here are my questions/thoughts on the API:

1. Why is the config String->Object instead of String->String?

2. Are these Java docs correct?

  KafkaConsumer(java.util.Map<java.lang.String,java.lang.Object> configs)
  A consumer is instantiated by providing a set of key-value pairs as
configuration and a ConsumerRebalanceCallback implementation

There is no ConsumerRebalanceCallback parameter.

3. Would like to have a method:

  poll(long timeout, java.util.concurrent.TimeUnit timeUnit,
TopicPartition... topicAndPartitionsToPoll)

I see I can effectively do this by just fiddling with subscribe and
unsubscribe before each poll. Is this a low-overhead operation? Can I just
unsubscribe from everything after each poll, then re-subscribe to a topic
the next iteration. I would probably be doing this in a fairly tight loop.

4. The behavior of AUTO_OFFSET_RESET_CONFIG is overloaded. I think there
are use cases for decoupling "what to do when no offset exists" from "what
to do when I'm out of range". I might want to start from smallest the
first time I run, but fail if I ever get offset out of range.

5. ENABLE_JMX could use Java docs, even though it's fairly
self-explanatory.

6. Clarity about whether FETCH_BUFFER_CONFIG is per-topic/partition, or
across all topic/partitions is useful. I believe it's per-topic/partition,
right? That is, setting to 2 megs with two TopicAndPartitions would result
in 4 megs worth of data coming in per fetch, right?

7. What does the consumer do if METADATA_FETCH_TIMEOUT_CONFIG times out?
Retry, or throw exception?

8. Does RECONNECT_BACKOFF_MS_CONFIG apply to both metadata requests and
fetch requests?

9. What does SESSION_TIMEOUT_MS default to?

10. Is this consumer thread-safe?

11. How do you use a different offset management strategy? Your email
implies that it's pluggable, but I don't see how. "The offset management
strategy defaults to Kafka based offset management and the API provides a
way for the user to use a customized offset store to manage the consumer's
offsets."

12. If I wish to decouple the consumer from the offset checkpointing, is
it OK to use Joel's offset management stuff directly, rather than through
the consumer's commit API?


Cheers,
Chris

On 2/10/14 10:54 AM, "Neha Narkhede" <neha.narkhede@gmail.com> wrote:

>As mentioned in previous emails, we are also working on a
>re-implementation
>of the consumer. I would like to use this email thread to discuss the
>details of the public API. I would also like us to be picky about this
>public api now so it is as good as possible and we don't need to break it
>in the future.
>
>The best way to get a feel for the API is actually to take a look at the
>javadoc<http://people.apache.org/~nehanarkhede/kafka-0.9-consumer-javadoc/
>doc/kafka/clients/consumer/KafkaConsumer.html>,
>the hope is to get the api docs good enough so that it is
>self-explanatory.
>You can also take a look at the configs
>here<http://people.apache.org/~nehanarkhede/kafka-0.9-consumer-javadoc/doc
>/kafka/clients/consumer/ConsumerConfig.html>
>
>Some background info on implementation:
>
>At a high level the primary difference in this consumer is that it removes
>the distinction between the "high-level" and "low-level" consumer. The new
>consumer API is non blocking and instead of returning a blocking iterator,
>the consumer provides a poll() API that returns a list of records. We
>think
>this is better compared to the blocking iterators since it effectively
>decouples the threading strategy used for processing messages from the
>consumer. It is worth noting that the consumer is entirely single threaded
>and runs in the user thread. The advantage is that it can be easily
>rewritten in less multi-threading-friendly languages. The consumer batches
>data and multiplexes I/O over TCP connections to each of the brokers it
>communicates with, for high throughput. The consumer also allows long poll
>to reduce the end-to-end message latency for low throughput data.
>
>The consumer provides a group management facility that supports the
>concept
>of a group with multiple consumer instances (just like the current
>consumer). This is done through a custom heartbeat and group management
>protocol transparent to the user. At the same time, it allows users the
>option to subscribe to a fixed set of partitions and not use group
>management at all. The offset management strategy defaults to Kafka based
>offset management and the API provides a way for the user to use a
>customized offset store to manage the consumer's offsets.
>
>A key difference in this consumer also is the fact that it does not depend
>on zookeeper at all.
>
>More details about the new consumer design are
>here<https://cwiki.apache.org/confluence/display/KAFKA/Kafka+0.9+Consumer+
>Rewrite+Design>
>
>Please take a look at the new
>API<http://people.apache.org/~nehanarkhede/kafka-0.9-consumer-javadoc/doc/
>kafka/clients/consumer/KafkaConsumer.html>and
>give us any thoughts you may have.
>
>Thanks,
>Neha