samza-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Felix GV <fville...@linkedin.com.INVALID>
Subject RE: How do you serve the data computed by Samza?
Date Wed, 01 Apr 2015 21:02:33 GMT
Hi Shekar,

I think the main question regarding your use case is the following: do you need read-your-writes
consistency?

If the answer is yes, then I'm afraid Samza is the wrong system for you. All processing in
Samza is bound to be asynchronous, so even if you tune your systems to get a super aggressive
end-to-end latency, it will still be hard to guarantee that whatever you write to Kafka to
be processed by Samza will be available for reading shortly after. For example, when Samza
deals with failed containers and re-bootstraps the local state in a new container, that will
for sure introduce delays in processing.

If the answer is no, then Samza may be a great choice for you, but that means you need to
accept that the feedback loop is asynchronous, and introduces a bit of delays. This, in turn,
means the data you read on the other end will be (at least a little) stale. For a lot of use
cases, a certain period of staleness in the data is perfectly fine (and hopefully this period
of time is bounded), but if you need to do synchronous writes, meaning the result of your
transformation is available for reading as soon as your write operation returns, then I think
Kafka and Samza in their current states are probably not the right choice.

Does that make sense?

--

Felix GV
Data Infrastructure Engineer
Distributed Data Systems
LinkedIn

fgv@linkedin.com
linkedin.com/in/felixgv

________________________________________
From: Shekar Tippur [ctippur@gmail.com]
Sent: Wednesday, April 01, 2015 12:49 PM
To: dev@samza.apache.org
Subject: Re: How do you serve the data computed by Samza?

Felix,

Since the webservices call is not maintained in a single session, how do we
tie up the incoming request to the enriched kv pair?

For example, lets take a use case
a given user (partitioned by user id) needs to be presented with all the
different regions he/she has been in to in a given time range.

Lets say that this involves 2 samza jobs to populate the kv store
(associated with a small time lag), should the webservices layer should
keep querying the kv store till the value is populated? Maybe I am not
thinking right?

- Shekar

On Wed, Apr 1, 2015 at 10:37 AM, Felix GV <fvillegas@linkedin.com.invalid>
wrote:

> Hi Vladimir,
>
> It seems like you have decided on an architecture which is fairly similar
> to the one I suggested (:
>
> Out of curiosity, I have the following questions regarding your system:
>
>   *   Are you saying the Samza tasks co-located with your k/v store nodes
> are also doing processing, and not just ingesting? And if they are doing
> processing:
>      *   Would you mind sharing what kind of processing they do? (i.e.:
> joining streams, counting stuff, etc.?)
>      *   Does the processing they do require local state? And does that
> local state and resulting IO compete with the resources of the co-located
> k/v store?
>   *   Would you mind also sharing which external k/v store you are
> ingesting into?
>
> Shekar,
>
>
> I'm not 100% sure I understand your question... what do you mean the
> producer has no control over the consumer? Partitioning can allow you to
> decide which specific consumer will get a given key, and the same
> partitioning strategy can also be used by the web service to query the
> correct k/v store shard/node, right?
>
>
> --
>
> Felix GV
> Data Infrastructure Engineer
> Distributed Data Systems
> LinkedIn
>
> fgv@linkedin.com
> linkedin.com/in/felixgv
>
> ________________________________________
> From: Shekar Tippur [ctippur@gmail.com]
> Sent: Wednesday, April 01, 2015 6:54 AM
> To: dev@samza.apache.org
> Subject: Re: How do you serve the data computed by Samza?
>
> I am still not fully sure how this would pan out. Since at each stage,
> producer sends an event and has no control over the consumer.
>
> {Web services call} -> {samza enrichment1} -> {samza enrichment2}
> ||
> ||
> V
> V
> {kv store1}
> {kv store2}
>
> In this case, how do we tie back the webservices call to data in kv store
> 2?
>
> - Shekar
>
> On Wed, Apr 1, 2015 at 1:49 AM, Vladimir Lebedev <wal@fastmail.fm> wrote:
>
> > Dear Harlan and Felix,
> >
> > Many thanks for your input! In my particular case I decided to use
> > external KV-store sharded by the same key as partition key in Kafka
> topic.
> > KV-store shards will be colocated with samza tasks processing
> corresponding
> > topic partitions in order to minimize latency.
> >
> > Best regards,
> > Vladimir
> >
> > --
> > Vladimir Lebedev
> > http://linkedin.com/in/vlebedev
> >
> >
> > On 03/31/2015 07:39 PM, Felix GV wrote:
> >
> >> Hi Harlan and Vladimir,
> >>
> >> I think the idea of serving data directly off of Samza has been
> mentioned
> >> a few times, but there are certain caveats that make this a risky
> >> proposition. For example:
> >>
> >> * Samza does not have the same uptime constraints as a dedicated
> >> data serving platform. While I'm a big fan of the fault-tolerant
> >> state provided by Samza (especially when compared to Storm and
> >> Spark Streaming), we have to realize that the Samza model
> >> effectively is to have one instance (or container) up and running
> >> for each partition, and to regenerate the state of that instance
> >> elsewhere if the first one goes down. This means that while the
> >> fault-tolerance is automatic, it is not instantaneous, it may take
> >> a minute or more, depending on the size of the state to
> >> regenerate. This is okay for the vast majority of nearline stream
> >> processing use cases, but not so much for a serving platform,
> >> which typically expects more redundancy. This is related to
> >> https://issues.apache.org/jira/browse/SAMZA-406, though even with
> >> those changes, it's not clear whether the fail-over would be
> >> considered fast enough for online data serving needs.
> >> * Samza is meant to support spiky workloads without too much
> >> problems, whether it is because of an actual spike in input
> >> traffic, or because of a re-processing use case (i.e.: the point
> >> #5 in the OP which I also clarified in my last email). A data
> >> serving system needs to much more wary of spikes, because it
> >> typically needs to maintain good p99 latency. Therefore, Samza and
> >> the serving system may have very different JVM tunings (if they
> >> even run on the JVM at all). Even if RocksDB was used in both
> >> Samza and the serving system, it may benefit from being tuned
> >> differently (i.e.: one for write throughput and the other for read
> >> performance).
> >>
> >>
> >> --
> >>
> >> Felix GV
> >> Data Infrastructure Engineer
> >> Distributed Data Systems
> >> LinkedIn
> >>
> >> fgv@linkedin.com
> >> linkedin.com/in/felixgv
> >>
> >> ________________________________________
> >> From: Harlan Iverson [harlan@pubnub.com]
> >> Sent: Saturday, March 28, 2015 5:30 PM
> >> To: dev@samza.apache.org
> >> Subject: Re: How do you serve the data computed by Samza?
> >>
> >> Felix/Shekar,
> >>
> >> Given that Samza itself uses RocksDB to create a materialized view of a
> >> partition from earliest to latest offset, I imagine that to be a good
> >> choice to begin evaluating. Two parts:
> >>
> >> 1. A method recommended in this article/video
> >> <http://blog.confluent.io/2015/03/04/turning-the-
> >> database-inside-out-with-apache-samza/>
> >> seems
> >> to suggest building a full in-process materialized view in each
> consuming
> >> process. This increases performance at the cost of space, so if that is
> an
> >> acceptable tradeoff then it should be golden.
> >>
> >> One approach may be writing to a Samza topic-backed key-value (KV) store
> >> in
> >> one upstream task and consuming it from one-or-more downstream KV-store
> on
> >> the same topic (eg. a task to lookup derived info by sub_key in Jordan's
> >> example above, and consume it later, keying the messages by sub_key).
> The
> >> reasoning behind this is that the KV-store backing is simply a Kafka
> topic
> >> partition that leverages the compaction model (keep only/at-least the
> last
> >> message for a given key), and then sequentially populates the KV-store
> in
> >> each consuming process upon process creation and keeps it updated during
> >> operation. I believe a caveat is that any consuming tasks would then
> have
> >> to use the KV-store as read only.
> >>
> >> Given the single-threaded model of Samza and partition-level ordering of
> >> Kafka, I think that if a) all topics in the pipeline have the same
> number
> >> of partitions, b) messages are published with the same key at each step,
> >> and c) Kafka leader write acking is used, then a store value should
> always
> >> be committed before the downstream task consumes it, though I'm not sure
> >> how to guarantee that it is actually consumed first downstream given
> that
> >> consumers may be slightly behind. Is simply relying on "behind high
> >> watermark"=0 on the KV consumer topic reliable enough to ensure this? If
> >> no, could the latest Kafka offset of the KV stream be tagged onto to the
> >> message upstream and then held downstream until the consumer's KV-store
> >> reaches that offset?
> >>
> >> 2. As for how to then serve it, would it be a bad idea to embed a
> >> REST/http
> >> server in a Samza task itself (effectively one per container/task
> >> instance)
> >> and put them behind a dynamically updated load balancer, updating
> mappings
> >> when the containers are launched/destroyed?
> >>
> >> One could also directly materialize their own RocksDB or similar in
> >> language/platform of choice by following the same protocol of feeding a
> >> Kafka topic from oldest offset into an in-process KV store and serving
> >> against it, creating a materialized view in the same fashion.
> >>
> >> I imagine that a (forthcoming?) transaction model in Kafka
> >> <https://cwiki.apache.org/confluence/display/KAFKA/
> >> Transactional+Messaging+in+Kafka>
> >> could
> >> provide a consistent view across all consumers, but until then there may
> >> need to be tolerance for them to be slightly out of sync.
> >>
> >> --
> >>
> >> We've not implemented any of this, but it's the approach that I'd first
> >> think to take. To begin with the task KV-store route, there are some
> >> scripts included with Samza to test KV-store performance that would give
> >> some concrete r/w performance numbers, and could probably be further
> >> explored by implementing the KV-store interfaces for any given JVM
> storage
> >> engine/client.
> >>
> >> Cheers
> >>
> >>
> >> On Fri, Mar 27, 2015 at 7:06 PM, Shekar Tippur <ctippur@gmail.com>
> wrote:
> >>
> >> > Felix/Jordan,
> >> >
> >> > 1 - 2 is exactly what I was looking for as well. I want to expose
> >> > webservices call to Kafka/samza. As there is no concept of a session,
> I
> >> was
> >> > wondering how to send back enriched data to the web services request.
> >> > Or am I way off on this? Meaning, is this a completely wrong use case
> to
> >> > use Kafka/Samza?
> >> >
> >> > - Shekar
> >> >
> >> > On Fri, Mar 27, 2015 at 12:42 PM, Jordan Shaw <jordan@pubnub.com>
> >> wrote:
> >> >
> >> > > Felix,
> >> > > Here are my thoughts below
> >> > >
> >> > > 1 - 2) I think so far a majority of samza applications are internal
> so
> >> > far.
> >> > > However I've developed a Samza Publisher for PubNub that would allow
> >> you
> >> > to
> >> > > send data from process or window out over a Data Stream Network.
> Right
> >> > now
> >> > > it looks something like this:
> >> > >
> >> > > (.send collector (OutgoingMessageEnvelope. (SystemStream.
> >> > > "pubnub.some-channel") {:pub_key demo :sub_key demo} some-data)).
> >> > >
> >> > > At smaller scale you could do the same with socket.io etc... If
> >> you're
> >> > > interested in this I can send you the src or jar. If their is wider
> >> > > interest I can open source it on github but it needs some cleanup
> >> first.
> >> > >
> >> > > 3) We currently don't have the need to warehouse our stream but we
> >> have
> >> > > thought about piping samza generated data into some Hadoop based
> >> system
> >> > for
> >> > > longer term analysis. Then running Hive queries over that data or
> >> > something
> >> > > alike.
> >> > >
> >> > > 4) I can't comment on the throughput of the other systems (HBase
> >> etc..)
> >> > but
> >> > > our Kafka, Samza through put is pretty impressive considering the
> >> single
> >> > > thread nature of the system. We are seeing raw throughput per
> >> partition
> >> > > over well 10MB/s.
> >> > >
> >> > > 5) I haven't run into this to prevent data loss/backup if we can't
> >> > process
> >> > > a message we have considered dropping it into a "unprocessed topic"
> >> but
> >> > we
> >> > > haven't really run into this need. If you needed to reprocess all
> raw
> >> > data
> >> > > it would be pretty straightforward, you could just add a partition
> to
> >> > > support the extra load.
> >> > >
> >> > > 6) Kafka is pretty good at ingesting things so could you elaborate
> >> more
> >> > on
> >> > > this?
> >> > >
> >> > > On Fri, Mar 27, 2015 at 9:52 AM, Felix GV <fvillegas@linkedin.com.
> >> invalid
> >> > >
> >> > > wrote:
> >> > >
> >> > > > Hi Samza devs, users and enthusiasts,
> >> > > >
> >> > > > I've kept an eye on the Samza project for a while and I think
it's
> >> > super
> >> > > > cool! I hope it continues to mature and expand as it seems very
> >> > > promising (:
> >> > > >
> >> > > > One thing I've been wondering for a while is: how do people serve
> >> the
> >> > > data
> >> > > > they computed on Samza? More specifically:
> >> > > >
> >> > > > 1. How do you expose the output of Samza jobs to online
> applications
> >> > > > that need low-latency reads?
> >> > > > 2. Are these online apps mostly internal (i.e.: analytics,
> >> > dashboards,
> >> > > > etc.) or public/user-facing?
> >> > > > 3. What systems do you currently use (or plan to use in the
> >> > > short-term)
> >> > > > to host the data generated in Samza? HBase? Cassandra? MySQL?
> Druid?
> >> > > Others?
> >> > > > 4. Are you satisfied or are you facing challenges in terms of
the
> >> > > write
> >> > > > throughput supported by these storage/serving systems? What about
> >> read
> >> > > > throughput?
> >> > > > 5. Are there situations where you wish to re-process all
> historical
> >> > > > data when making improvements to your Samza job, which results
in
> >> the
> >> > > need
> >> > > > to re-ingest all of the Samza output into your online serving
> system
> >> > (as
> >> > > > described in the Kappa Architecture<
> >> > > >
> >> > >
> >> > http://radar.oreilly.com/2014/07/questioning-the-lambda-
> >> architecture.html
> >> > > >)
> >> > > > ? Is this easy breezy or painful? Do you need to throttle it
lest
> >> your
> >> > > > serving system will fall over?
> >> > > > 6. If there was a highly-optimized and reliable way of ingesting
> >> > > > partitioned streams quickly into your online serving system,
would
> >> that
> >> > > > help you leverage Samza more effectively?
> >> > > >
> >> > > > Your insights would be much appreciated!
> >> > > >
> >> > > >
> >> > > > Thanks (:
> >> > > >
> >> > > >
> >> > > > --
> >> > > > Felix
> >> > > >
> >> > >
> >> > >
> >> > >
> >> > > --
> >> > > Jordan Shaw
> >> > > Full Stack Software Engineer
> >> > > PubNub Inc
> >> > > 1045 17th St
> >> > > San Francisco, CA 94107
> >> > >
> >> >
> >>
> >
> >
>

Mime
View raw message