samza-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Riccomini <>
Subject Re: How does Samza overcome the topics # limititions of Kafka
Date Fri, 23 Aug 2013 23:16:23 GMT
Hey Guys,

I took a shot at updating the docs:

The page now reads:

"The input topic is partitioned using Kafka. Each Samza process reads
messages from one or more of the input topic's partitions, and emits them
back out to a different Kafka topic. Each output message is keyed by the
message's member ID attribute, and this key is mapped to one of the
topic's partitions (usually by hashing the key, and modding by the number
of partitions in the topic). The Kafka brokers receive these messages, and
buffer them on disk until the second job (the counting job on the bottom
of the diagram) reads the messages, and increments its counters."


On 8/23/13 8:25 AM, "Jay Kreps" <> wrote:

>Oh, I think perhaps that documentation is a bit confusing. The member id
>would be used for partition selection but not every member id would be a
>partition. For example if you had four partitions you could partition by
>hash(key) % 4.
>The partition count essentially bounds the parallelism of the downstream
>processing (i.e. you cant have more containers, the physical processes,
>then you have tasks). Formally
>  max(# upstream partitions) = # tasks < # containers
>Our observation is that stream jobs don't require massive parallelism in
>the way that, for example, Hadoop jobs do, though they often process the
>same data. This is because they run continuously and pipeline processing.
>In MapReduce if you have a daily job that processes that day's worth of
>data it blocks all uses of its output until it completes. As a result you
>end up in a situation where you want to process 24 hours of data in some
>set of mapreduce jobs and a particular stage may need to complete pretty
>quickly, say in, 10 minutes. Obvious this requires processing at 24*60/10
>144 times the speed of data acquisition. So you need a sudden burst of
>tasks to finish this up as quick as possible. Cases where mapreduce
>processing isn't incremental at all (i.e. all data is reprocessed) are
>more extreme. A stream processing task generally only needs to keep up,
>which is 144 times less demanding. The exception, of course, is if you
>anticipate periods of downtime you will need to be able to catch up at a
>reasonable rate.
>Currently in Kafka the primary downside of high partition count is longer
>fail-over time. This is a big problem for cases where Kafka is taking live
>requests that block a website. But for stream processing a 1 second
>failover is usually fine. This makes partition counts in the ballpark of
>100k feasible (but we haven't gone there).
>Longer term we do think even that will be an issue and the plan is to just
>work on scaling Kafka's ability to handle high partition counts
>On Fri, Aug 23, 2013 at 3:57 AM, Stone <> wrote:
>> Hello,
>> As explained in the following docs:
>> The input topic is partitioned using Kafka. Each Samza process reads
>> messages from one or more of the input topic's partitions, and emits
>> back out to a different Kafka topic keyed by the message's member ID
>> attribute.
>> In the example above, the task will created many topics keyed by
>> member ID attribute", if there's millions of intermediate keys, how does
>> Samza handle the topic limitations of Kafka? (Ref
>>  )
>> Best Regards,
>> Stone

View raw message