samza-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stone <stones....@gmail.com>
Subject Re: How does Samza overcome the topics # limititions of Kafka
Date Sat, 24 Aug 2013 03:27:51 GMT
Thanks for the clarification.

BTW: Another question:
http://samza.incubator.apache.org/learn/documentation/0.7.0/comparisons/introduction.html

In the Sate section of the Samza intro, it says that Samza Tasks can create
and restore state from local storage (leveldb) , but how does Samza ensure
that the local state & the task that created the state is always on the
same machine ?  For instance :

task A for some topic's partition #0 first running on Machine A which
creates localstate SA. When failure occurs or restarts, who does the
scheduler ensure the tuple (Task, Topic partition #, State for Task) are
always bundled together ?


On Sat, Aug 24, 2013 at 7:16 AM, Chris Riccomini <criccomini@linkedin.com>wrote:

> Hey Guys,
>
> I took a shot at updating the docs:
>
>
> http://samza.incubator.apache.org/learn/documentation/0.7.0/introduction/ar
> chitecture.html
>
> The page now reads:
>
> "The input topic is partitioned using Kafka. Each Samza process reads
> messages from one or more of the input topic's partitions, and emits them
> back out to a different Kafka topic. Each output message is keyed by the
> message's member ID attribute, and this key is mapped to one of the
> topic's partitions (usually by hashing the key, and modding by the number
> of partitions in the topic). The Kafka brokers receive these messages, and
> buffer them on disk until the second job (the counting job on the bottom
> of the diagram) reads the messages, and increments its counters."
>
>
> Cheers,
> Chris
>
> On 8/23/13 8:25 AM, "Jay Kreps" <jay.kreps@gmail.com> wrote:
>
> >Oh, I think perhaps that documentation is a bit confusing. The member id
> >would be used for partition selection but not every member id would be a
> >partition. For example if you had four partitions you could partition by
> >hash(key) % 4.
> >
> >The partition count essentially bounds the parallelism of the downstream
> >processing (i.e. you cant have more containers, the physical processes,
> >then you have tasks). Formally
> >  max(# upstream partitions) = # tasks < # containers
> >
> >Our observation is that stream jobs don't require massive parallelism in
> >the way that, for example, Hadoop jobs do, though they often process the
> >same data. This is because they run continuously and pipeline processing.
> >
> >In MapReduce if you have a daily job that processes that day's worth of
> >data it blocks all uses of its output until it completes. As a result you
> >end up in a situation where you want to process 24 hours of data in some
> >set of mapreduce jobs and a particular stage may need to complete pretty
> >quickly, say in, 10 minutes. Obvious this requires processing at 24*60/10
> >=
> >144 times the speed of data acquisition. So you need a sudden burst of
> >many
> >tasks to finish this up as quick as possible. Cases where mapreduce
> >processing isn't incremental at all (i.e. all data is reprocessed) are
> >even
> >more extreme. A stream processing task generally only needs to keep up,
> >which is 144 times less demanding. The exception, of course, is if you
> >anticipate periods of downtime you will need to be able to catch up at a
> >reasonable rate.
> >
> >Currently in Kafka the primary downside of high partition count is longer
> >fail-over time. This is a big problem for cases where Kafka is taking live
> >requests that block a website. But for stream processing a 1 second
> >failover is usually fine. This makes partition counts in the ballpark of
> >100k feasible (but we haven't gone there).
> >
> >Longer term we do think even that will be an issue and the plan is to just
> >work on scaling Kafka's ability to handle high partition counts
> >gracefully.
> >
> >Cheers,
> >
> >-Jay
> >
> >
> >On Fri, Aug 23, 2013 at 3:57 AM, Stone <stones.gao@gmail.com> wrote:
> >
> >> Hello,
> >>
> >> As explained in the following docs:
> >>
> >>
> >>
> >>
> http://samza.incubator.apache.org/learn/documentation/0.7.0/introduction/
> >>architecture.html
> >>
> >> The input topic is partitioned using Kafka. Each Samza process reads
> >> messages from one or more of the input topic's partitions, and emits
> >>them
> >> back out to a different Kafka topic keyed by the message's member ID
> >> attribute.
> >>
> >> In the example above, the task will created many topics keyed by
> >>"message's
> >> member ID attribute", if there's millions of intermediate keys, how does
> >> Samza handle the topic limitations of Kafka? (Ref
> >>
> >>
> http://grokbase.com/t/kafka/users/133v60ng6v/limit-on-number-of-kafka-top
> >>ic
> >>  )
> >>
> >>
> >> Best Regards,
> >> Stone
> >>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message