spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Krzysztof Zarzycki <>
Subject Re: Is it feasible to keep millions of keys in state of Spark Streaming job for two months?
Date Wed, 15 Apr 2015 05:36:51 GMT
Thank you Tathagata, very helpful answer.

Though, I would like to highlight that recent stream processing systems are
trying to help users in implementing use case of holding such large (like 2
months of data) states. I would mention here Samza state management
Trident state management
<>. I'm waiting when
Spark would help with that too, because generally I definitely prefer this

But considering holding state in Cassandra with Spark Streaming, I
understand we're not talking here about using Cassandra as input nor output
(nor make use of spark-cassandra-connector
<>). We're talking
here about querying Cassandra from map/mapPartition functions.
I have one question about it: Is it possible to query Cassandra
asynchronously within Spark Streaming? And while doing it, is it possible
to take next batch of rows, while the previous is waiting on Cassandra I/O?
I think (but I'm not sure) this generally asks, whether several consecutive
windows can interleave (because they are long to process)? Let's draw it:

<------|query Cassandra asynchronously--- > window1
        <---------------------------------------> window2

While writing it, I start to believe they can, because windows are
time-triggered, not triggered when previous window has finished... But it's
better to ask:)

2015-04-15 2:08 GMT+02:00 Tathagata Das <>:

> Fundamentally, stream processing systems are designed for processing
> streams of data, not for storing large volumes of data for a long period of
> time. So if you have to maintain that much state for months, then its best
> to use another system that is designed for long term storage (like
> Cassandra) which has proper support for making all that state
> fault-tolerant, high-performant, etc. So yes, the best option is to use
> Cassandra for the state and Spark Streaming jobs accessing the state from
> Cassandra. There are a number of optimizations that can be done. Its not
> too hard to build a simple on-demand populated cache (singleton hash map
> for example), that speeds up access from Cassandra, and all updates are
> written through the cache. This is a common use of Spark Streaming +
> Cassandra/HBase.
> Regarding the performance of updateStateByKey, we are aware of the
> limitations, and we will improve it soon :)
> TD
> On Tue, Apr 14, 2015 at 12:34 PM, Krzysztof Zarzycki <
> > wrote:
>> Hey guys, could you please help me with a question I asked on
>> Stackoverflow:
>> ?  I'll be really grateful for your help!
>> I'm also pasting the question below:
>> I'm trying to solve a (simplified here) problem in Spark Streaming: Let's
>> say I have a log of events made by users, where each event is a tuple (user
>> name, activity, time), e.g.:
>> ("user1", "view", "2015-04-14T21:04Z") ("user1", "click",
>> "2015-04-14T21:05Z")
>> Now I would like to gather events by user to do some analysis of that.
>> Let's say that output is some analysis of:
>> ("user1", List(("view", "2015-04-14T21:04Z"),("click",
>> "2015-04-14T21:05Z"))
>> The events should be kept for even *2 months*. During that time there
>> might be around *500 milion*of such events, and *millions of unique* users,
>> which are keys here.
>> *My questions are:*
>>    - Is it feasible to do such a thing with updateStateByKey on DStream,
>>    when I have millions of keys stored?
>>    - Am I right that DStream.window is no use here, when I have 2 months
>>    length window and would like to have a slide of few seconds?
>> P.S. I found out, that updateStateByKey is called on all the keys on
>> every slide, so that means it will be called millions of time every few
>> seconds. That makes me doubt in this design and I'm rather thinking about
>> alternative solutions like:
>>    - using Cassandra for state
>>    - using Trident state (with Cassandra probably)
>>    - using Samza with its state management.

View raw message