samza-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jay Kreps <jay.kr...@gmail.com>
Subject Re: How does Samza overcome the topics # limititions of Kafka
Date Fri, 23 Aug 2013 15:25:54 GMT
Oh, I think perhaps that documentation is a bit confusing. The member id
would be used for partition selection but not every member id would be a
partition. For example if you had four partitions you could partition by
hash(key) % 4.

The partition count essentially bounds the parallelism of the downstream
processing (i.e. you cant have more containers, the physical processes,
then you have tasks). Formally
  max(# upstream partitions) = # tasks < # containers

Our observation is that stream jobs don't require massive parallelism in
the way that, for example, Hadoop jobs do, though they often process the
same data. This is because they run continuously and pipeline processing.

In MapReduce if you have a daily job that processes that day's worth of
data it blocks all uses of its output until it completes. As a result you
end up in a situation where you want to process 24 hours of data in some
set of mapreduce jobs and a particular stage may need to complete pretty
quickly, say in, 10 minutes. Obvious this requires processing at 24*60/10 =
144 times the speed of data acquisition. So you need a sudden burst of many
tasks to finish this up as quick as possible. Cases where mapreduce
processing isn't incremental at all (i.e. all data is reprocessed) are even
more extreme. A stream processing task generally only needs to keep up,
which is 144 times less demanding. The exception, of course, is if you
anticipate periods of downtime you will need to be able to catch up at a
reasonable rate.

Currently in Kafka the primary downside of high partition count is longer
fail-over time. This is a big problem for cases where Kafka is taking live
requests that block a website. But for stream processing a 1 second
failover is usually fine. This makes partition counts in the ballpark of
100k feasible (but we haven't gone there).

Longer term we do think even that will be an issue and the plan is to just
work on scaling Kafka's ability to handle high partition counts gracefully.

Cheers,

-Jay


On Fri, Aug 23, 2013 at 3:57 AM, Stone <stones.gao@gmail.com> wrote:

> Hello,
>
> As explained in the following docs:
>
>
> http://samza.incubator.apache.org/learn/documentation/0.7.0/introduction/architecture.html
>
> The input topic is partitioned using Kafka. Each Samza process reads
> messages from one or more of the input topic's partitions, and emits them
> back out to a different Kafka topic keyed by the message's member ID
> attribute.
>
> In the example above, the task will created many topics keyed by "message's
> member ID attribute", if there's millions of intermediate keys, how does
> Samza handle the topic limitations of Kafka? (Ref
> http://grokbase.com/t/kafka/users/133v60ng6v/limit-on-number-of-kafka-topic
>  )
>
>
> Best Regards,
> Stone
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message