flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christian Kreutzfeldt <mnx...@gmail.com>
Subject Re: Flink, Kappa and Lambda
Date Fri, 13 Nov 2015 12:21:18 GMT

Personally, I find the the concepts of the so-called Kappa
architecture intriguing. But I doubt that it is applicable in a generic
setup where different use cases are mapped into the architecture. To be
fair, I think the same applies to Lambda architectures. Therefore I
wouldn't assume that Lambda architectures are obsolete with the advent of
Kappa as new architectural paradigm.

>From my point of view, it all depends on the use case that you want to
solve. For example, I saw a presentation given by Eric Tschetter and
Fangjin Yang of MetaMarkets on how they use Hadoop and Druid to drive their
business. They used Hadoop as long-term storage and Druid on the serving
layer to provide up-to-date data into the business by updating it in
sub-second intervals. Regularly they algin both systems to be consistent.
In their case, the Lambda architecture serves their business quite well:
speed achieved through the streaming layer and long time persistence
through the batch layer.

In cases where you - for example - want to create views on customer
sessions by aggregating all events belonging to a single person and use
them to

* serve recommendation systems while the customer is still on your website
* keep them persistent in a long-term archive

people tend to build typical Lambda architectures with duplicated
sessionizing code on both layers. From my point of view this is unnecessary
and introduces an additional source of errors. As customer sessions are
created as stream of events, simply implement the logic on your streaming
layer and persist the final session after a timeout in those systems where
you need the data to be present: eg. recommender system receives constant
updates on each new event and the batch layer (Hadoop) receives the
finished session after it timed out.

As Lambda - in most cases - is implemented to do the same thing on both
layers, later merging the results to keep states consistent, the Kappa
architecture introduces an interesting pattern that people often are not
aware of. The idea to persist the stream itself and get rid of other
systems, like RDBMS, NoSQL DBs or any other type of archive software, is
often accepted as cheap way to reduce costs and maintenance efforts.

But I think Kappa does more and may be expanded to other systems than
streaming as well. You keep the data at that system persistent where it
arrived or received a state you expect in subsequent systems. Why should I
convert a stream of tracking events into a static schema and store the data
inside an RDBMS? What if I rely on its nature that data is coming in as
stream and do not want to have it exported/imported as bulk update but have
the same stream replayed later? What about information loss? Being a stream
of events is part of the information as well like the attributes each event

So, if Kappa is understood as architectural pattern where data is kept and
processed the way it arrived or is expected by subsequent systems, I do not
think that it will ever replace Lambda but it will complement it.

Therefore I would like to give you the advice to look at your use case(s)
and design the architecture as you need it. Do not stick with a certain
pattern but deploy those parts that fit with your use-case. This context is
far too young that it provides you with additional value strictly following
a certain pattern, eg to make it more easier to integrate with third-party


2015-11-13 9:51 GMT+01:00 Welly Tambunan <if05041@gmail.com>:

> Hi rss rss,
> Yes. I have already read that book.
> However given the state of streaming right now, and Kappa Architecture, I
> don't think we need Lambda Architecture again ?
> Any thoughts ?
> On Thu, Nov 12, 2015 at 12:29 PM, rss rss <rssdev10@gmail.com> wrote:
>> Hello,
>>   regarding the Lambda architecture there is a following book -
>> https://www.manning.com/books/big-data (Big Data. Principles and best
>> practices of scalable realtime data systems
>>  Nathan Marz and James Warren).
>> Regards,
>> Roman
>> 2015-11-12 4:47 GMT+03:00 Welly Tambunan <if05041@gmail.com>:
>>> Hi Stephan,
>>> Thanks for your response.
>>> We are trying to justify whether it's enough to use Kappa Architecture
>>> with Flink. This more about resiliency and message lost issue etc.
>>> The article is worry about message lost even if you are using Kafka.
>>> No matter the message queue or broker you rely on whether it be
>>> RabbitMQ, JMS, ActiveMQ, Websphere, MSMQ and yes even Kafka you can lose
>>> messages in any of the following ways:
>>>    - A downstream system from the broker can have data loss
>>>    - All message queues today can lose already acknowledged messages
>>>    during failover or leader election.
>>>    - A bug can send the wrong messages to the wrong systems.
>>> Cheers
>>> On Wed, Nov 11, 2015 at 4:13 PM, Stephan Ewen <sewen@apache.org> wrote:
>>>> Hi!
>>>> Can you explain a little more what you want to achieve? Maybe then we
>>>> can give a few more comments...
>>>> I briefly read through some of the articles you linked, but did not
>>>> quite understand their train of thoughts.
>>>> For example, letting Tomcat write to Cassandra directly, and to Kafka,
>>>> might just be redundant. Why not let the streaming job that reads the Kafka
>>>> queue
>>>> move the data to Cassandra as one of its results? Further more, durable
>>>> storing the sequence of events is exactly what Kafka does, but the article
>>>> suggests to use Cassandra for that, which I find very counter
>>>> intuitive. It looks a bit like the suggested approach is only adopting
>>>> streaming for half the task.
>>>> Greetings,
>>>> Stephan
>>>> On Tue, Nov 10, 2015 at 7:49 AM, Welly Tambunan <if05041@gmail.com>
>>>> wrote:
>>>>> Hi All,
>>>>> I read a couple of article about Kappa and Lambda Architecture.
>>>>> http://www.confluent.io/blog/real-time-stream-processing-the-next-step-for-apache-flink/
>>>>> I'm convince that Flink will simplify this one with streaming.
>>>>> However i also stumble upon this blog post that has valid argument to
>>>>> have a system of record storage ( event sourcing ) and finally lambda
>>>>> architecture is appear at the solution. Basically it will write twice
>>>>> Queuing system and C* for safety. System of record here is basically
>>>>> storing the event (delta).
>>>>> [image: Inline image 1]
>>>>> https://lostechies.com/ryansvihla/2015/09/17/event-sourcing-and-system-of-record-sane-distributed-development-in-the-modern-era-2/
>>>>> Another approach is about lambda architecture for maintaining the
>>>>> correctness of the system.
>>>>> https://lostechies.com/ryansvihla/2015/09/17/real-time-analytics-with-spark-streaming-and-cassandra/
>>>>> Given that he's using Spark for the streaming processor, do we have to
>>>>> do the same thing with Apache Flink ?
>>>>> Cheers
>>>>> --
>>>>> Welly Tambunan
>>>>> Triplelands
>>>>> http://weltam.wordpress.com
>>>>> http://www.triplelands.com <http://www.triplelands.com/blog/>
>>> --
>>> Welly Tambunan
>>> Triplelands
>>> http://weltam.wordpress.com
>>> http://www.triplelands.com <http://www.triplelands.com/blog/>
> --
> Welly Tambunan
> Triplelands
> http://weltam.wordpress.com
> http://www.triplelands.com <http://www.triplelands.com/blog/>

View raw message