Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 7EE0A9F9A for ; Thu, 18 Dec 2014 14:13:12 +0000 (UTC) Received: (qmail 14818 invoked by uid 500); 18 Dec 2014 14:13:07 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 14774 invoked by uid 500); 18 Dec 2014 14:13:07 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 14764 invoked by uid 99); 18 Dec 2014 14:13:07 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 18 Dec 2014 14:13:07 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS,T_REMOTE_IMAGE X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of woolfel@gmail.com designates 209.85.215.54 as permitted sender) Received: from [209.85.215.54] (HELO mail-la0-f54.google.com) (209.85.215.54) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 18 Dec 2014 14:12:41 +0000 Received: by mail-la0-f54.google.com with SMTP id pv20so1069517lab.13 for ; Thu, 18 Dec 2014 06:11:10 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=nzcSPtKjUXUnwKMrkKAnPfkHyUYMGV3+tCB22E6rMQ4=; b=Zu8Kba8PyEjeaSZImDOPybITd7KMmkfIXvkikBhWaadDT7ffxnEsZ8Ur9CJPBCtZrx b8bcPZb5EeafMqQSvx1Pw3UR2MGTGMOJN6DVUYOxn/Jf/+YBSprKBmkgSAhc0E8x9FdC QnFVPSJbumNf+U2aTqeV+rUR/yRSUZBQLm6A6O1SzuS7uZTWZbhH9UFku3hDSnJhT8hq y6hyP05UpI8yzEQ44Rgv70xpYyKEN0cKqFQnfaj+DcNVCcc8chW/QFzr98rMe3Z0f724 oWb7S3eVw7I915qrOI2T1JSDCFvPXJwvV1o6lzKcNZUy266YKTvMsdKcAWeDlgfW91VW rluw== MIME-Version: 1.0 X-Received: by 10.152.203.201 with SMTP id ks9mr2402532lac.57.1418911870024; Thu, 18 Dec 2014 06:11:10 -0800 (PST) Received: by 10.25.16.137 with HTTP; Thu, 18 Dec 2014 06:11:09 -0800 (PST) In-Reply-To: References: Date: Thu, 18 Dec 2014 09:11:09 -0500 Message-ID: Subject: Re: Cassandra for Analytics? From: Peter Lin To: "user@cassandra.apache.org" Content-Type: multipart/alternative; boundary=001a1134668afb0540050a7e2928 X-Virus-Checked: Checked by ClamAV on apache.org --001a1134668afb0540050a7e2928 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable in the interest of knowledge sharing on the general topic of stream processing. the domain is quite old and there's a lot of existing literature. within this space there are several important factors which many products don't address: temporal windows (sliding windows, discrete windows, dynamic windows) - most support the first 2, but poorly on dynamic windows temporal validity - for how long is the data valid? - most don't support this temporal patterns - patterns that are valid for a finite amount of time - most don't support this as a first class concept temporal data types - machine learning systems that can create new data types - most don't support this temporal distance - the maximum time-to-live for a specific piece of data - most don't support this Having studied many stream processing products, most focus on simple queries on 1 tuple (aka object type) and basic joining of streams. A tuple here is basically equivalent to 1 table. Some stream products let you materialize views (aka projections) like summary tables, but most do not let you define an in-memory cube to make complex queries easier. For the most part, the developer has to mentally break down the queries into multiple pieces and do it manually. With most products, it's possible to hack together something that looks like a mdx query, but the level of effort differs. Even then, the bigger question is the overall architecture. Once the use case is known, it's much easier to decide what needs to be filtered before persistence and what needs to be summarized before persistence. peter On Thu, Dec 18, 2014 at 8:51 AM, Ryan Svihla wrote: > > My mistake on Storm, and I'm certain there are a number of use cases wher= e > you're right Spark isn't the right answer, but I'd argue your treating it > like 0.5 Spark feature set wise instead of 1.1 Spark. > > As for filtering before persistence..this is the common use case for spar= k > streaming and I've helped a number of enterprise customers do this very > thing (fraud using windows of various sizes, live aggregation of data, an= d > joins), typically pulling from a Kafka topic, but it can be adapted to > pretty much any source. > > I'd argue you were correct about everything at one time, but you're sayin= g > it can't do things it's been doing in production for awhile now. > > > On Thu, Dec 18, 2014 at 7:30 AM, Peter Lin wrote: >> >> >> for the record I think spark is good and I'm glad we have options. >> >> my point wasn't to bad mouth spark. I'm not comparing spark to storm at >> all, so I think there's some confusion here. I'm thinking of espers, >> streambase, and other stream processing products. My point is to think >> about the problems that needs to be solved before picking a solution. Li= ke >> everyone else, I've been guilty of this in the past, so it's not propaga= nda >> for or against any specific product. >> >> I've seen customers user IBM infosphere streams when something like stor= m >> or spark would work, but I've also seen cases where open source doesn't >> provide equivalent functionality. If spark meets the needs, then either >> hbase or cassandra will probably work fine. The bigger question is what >> patterns do you use in the architecture? Do you store the data first bef= ore >> doing analysis? Is the data noisy and needs filtering before persistence= ? >> What kinds of patterns/queries and operations are needed? >> >> having worked on trading systems and other real-time use cases, not all >> stream processing is the same. >> >> On Thu, Dec 18, 2014 at 8:18 AM, Ryan Svihla >> wrote: >>> >>> I'll decline to continue the commentary on spark, as again this probabl= y >>> belongs on another list, other than to say, microbatches is an intentio= nal >>> design tradeoff that has notable benefits for the same use cases you're >>> referring too, and that while you may disagree with those tradeoffs, it= 's a >>> bit harsh to dismiss as "basic" something that was chosen and provides = some >>> improvements over say..the Storm model. >>> >>> On Thu, Dec 18, 2014 at 7:13 AM, Peter Lin wrote: >>>> >>>> >>>> some of the most common types of use cases in stream processing is >>>> sliding windows based on time or count. Based on my understanding of s= park >>>> architecture and spark streaming, it does not provide the same >>>> functionality. One can fake it by setting spark streaming to really sm= all >>>> micro-batches, but that's not the same. >>>> >>>> if the use case fits that model, than using spark is fine. For other >>>> kinds of use cases, spark may not be a good fit. Some people store all >>>> events before analyzing it, which works for some use cases. While othe= r >>>> uses cases like trading systems, store before analysis isn't feasible = or >>>> practical. Other use cases like command control also don't fit store b= efore >>>> analysis model. >>>> >>>> Try to avoid putting the cart infront of the horse. Picking a tool >>>> before you have a clear understanding of the problem is a good recipe = for >>>> disaster >>>> >>>> On Thu, Dec 18, 2014 at 8:04 AM, Ryan Svihla >>>> wrote: >>>>> >>>>> Since Ajay is already using spark the Spark Cassandra Connector reall= y >>>>> gets them where they want to be pretty easily >>>>> https://github.com/datastax/spark-cassandra-connector (joins, etc). >>>>> >>>>> As far as spark streaming having "basic support" I'd challenge that >>>>> assertion (namely Storm has a number of problems with delivery guaran= tees >>>>> that Spark basically solves), however, this isn't a Spark mailing lis= t, and >>>>> perhaps this conversation is better had there. >>>>> >>>>> If the question "Is Cassandra used in real time analytics cases with >>>>> Spark?" the answer is absolutely yes (and Storm for that matter). If = the >>>>> question is "Can you do your analytics queries on Cassandra while you= have >>>>> Spark sitting there doing nothing?" then of course the answer is no, = but >>>>> that'd be a bizzare question, they already have Spark in use. >>>>> >>>>> On Thu, Dec 18, 2014 at 6:52 AM, Peter Lin wrote: >>>>>> >>>>>> that depends on what you mean by real-time analytics. >>>>>> >>>>>> For things like continuous data streams, neither are appropriate >>>>>> platforms for doing analytics. They're good for storing the results = (aka >>>>>> output) of the streaming analytics. I would suggest before you decid= e >>>>>> cassandra vs hbase, first figure out exactly what kind of analytics = you >>>>>> need to do. Start with prototyping and look at what kind of queries = and >>>>>> patterns you need to support. >>>>>> >>>>>> neither hbase or cassandra are good for complex patterns that do >>>>>> joins or cross joins (aka mdx), so using either one you have to re-i= nvent >>>>>> stuff. >>>>>> >>>>>> most of the event processing and stream processing products out ther= e >>>>>> also don't support joins or cross joins very well, so any solution i= s going >>>>>> to need several different components. typically stream processing do= es >>>>>> filtering, which feeds another system that does simple joins. The ou= tput of >>>>>> the second step can then go to another system that does mdx style qu= eries. >>>>>> >>>>>> spark streaming has basic support, but it's not as mature and featur= e >>>>>> rich as other stream processing products. >>>>>> >>>>>> On Wed, Dec 17, 2014 at 11:20 PM, Ajay wrote: >>>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> Can Cassandra be used or best fit for Real Time Analytics? I went >>>>>>> through couple of benchmark between Cassandra Vs HBase (most of it = was done >>>>>>> 3 years ago) and it mentioned that Cassandra is designed for intens= ive >>>>>>> writes and Cassandra has higher latency for reads than HBase. In ou= r case, >>>>>>> we will have writes and reads (but reads will be more say 40% write= s and >>>>>>> 60% reads). We are planning to use Spark as the in memory computati= on >>>>>>> engine. >>>>>>> >>>>>>> Thanks >>>>>>> Ajay >>>>>>> >>>>>> >>>>> >>>>> -- >>>>> >>>>> [image: datastax_logo.png] >>>>> >>>>> Ryan Svihla >>>>> >>>>> Solution Architect >>>>> >>>>> [image: twitter.png] [image: >>>>> linkedin.png] >>>>> >>>>> DataStax is the fastest, most scalable distributed database >>>>> technology, delivering Apache Cassandra to the world=E2=80=99s most i= nnovative >>>>> enterprises. Datastax is built to be agile, always-on, and predictabl= y >>>>> scalable to any size. With more than 500 customers in 45 countries, D= ataStax >>>>> is the database technology and transactional backbone of choice for t= he >>>>> worlds most innovative companies such as Netflix, Adobe, Intuit, and = eBay. >>>>> >>>>> >>> >>> -- >>> >>> [image: datastax_logo.png] >>> >>> Ryan Svihla >>> >>> Solution Architect >>> >>> [image: twitter.png] [image: linkedin.png= ] >>> >>> >>> DataStax is the fastest, most scalable distributed database technology, >>> delivering Apache Cassandra to the world=E2=80=99s most innovative ente= rprises. >>> Datastax is built to be agile, always-on, and predictably scalable to a= ny >>> size. With more than 500 customers in 45 countries, DataStax is the >>> database technology and transactional backbone of choice for the worlds >>> most innovative companies such as Netflix, Adobe, Intuit, and eBay. >>> >>> > > -- > > [image: datastax_logo.png] > > Ryan Svihla > > Solution Architect > > [image: twitter.png] [image: linkedin.png] > > > DataStax is the fastest, most scalable distributed database technology, > delivering Apache Cassandra to the world=E2=80=99s most innovative enterp= rises. > Datastax is built to be agile, always-on, and predictably scalable to any > size. With more than 500 customers in 45 countries, DataStax is the > database technology and transactional backbone of choice for the worlds > most innovative companies such as Netflix, Adobe, Intuit, and eBay. > > --001a1134668afb0540050a7e2928 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
in the interest of= knowledge sharing on the general topic of stream processing. the domain is= quite old and there's a lot of existing literature.

withi= n this space there are several important factors which many products don= 9;t address:

temporal windows (sliding windows, discrete windo= ws, dynamic windows) - most support the first 2, but poorly on dynamic wind= ows
temporal validity - for how long is the data valid? - most don= 't support this
temporal patterns - patterns that are valid fo= r a finite amount of time - most don't support this as a first class co= ncept
temporal data types - machine learning systems that can create new= data types - most don't support this
temporal distance - the = maximum time-to-live for a specific piece of data - most don't support = this

Having studied many stream processing products, most focu= s on simple queries on 1 tuple (aka object type) and basic joining of strea= ms. A tuple here is basically equivalent to 1 table. Some stream products l= et you materialize views (aka projections) like summary tables, but most do= not let you define an in-memory cube to make complex queries easier. For t= he most part, the developer has to mentally break down the queries into mul= tiple pieces and do it manually.

With most products, it's = possible to hack together something that looks like a mdx query, but the le= vel of effort differs. Even then, the bigger question is the overall archit= ecture. Once the use case is known, it's much easier to decide what nee= ds to be filtered before persistence and what needs to be summarized before= persistence.

peter

On Thu, Dec 18, 2014 at 8:51 AM, Ryan Svihla <= rsvihla@datastax.com> wrote:
My mistake on Storm, and I'm certain there are a number = of use cases where you're right Spark isn't the right answer, but I= 'd argue your treating it like 0.5 Spark feature set wise instead of 1.= 1 Spark.

As for filtering before persistence..this is th= e common use case for spark streaming and I've helped a number of enter= prise customers do this very thing (fraud using windows of various sizes, l= ive aggregation of data, and joins), typically pulling from a Kafka topic, = but it can be adapted to pretty much any source.

<= div>I'd argue you were correct about everything at one time, but you= 9;re saying it can't do things it's been doing in production for aw= hile now.


On Thu, D= ec 18, 2014 at 7:30 AM, Peter Lin <woolfel@gmail.com> wrote:=

fo= r the record I think spark is good and I'm glad we have options.
my point wasn't to bad mouth spark. I'm not comparing spark = to storm at all, so I think there's some confusion here. I'm thinki= ng of espers, streambase, and other stream processing products. My point is= to think about the problems that needs to be solved before picking a solut= ion. Like everyone else, I've been guilty of this in the past, so it= 9;s not propaganda for or against any specific product.

I've see= n customers user IBM infosphere streams when something like storm or spark = would work, but I've also seen cases where open source doesn't prov= ide equivalent functionality. If spark meets the needs, then either hbase o= r cassandra will probably work fine. The bigger question is what patterns d= o you use in the architecture? Do you store the data first before doing ana= lysis? Is the data noisy and needs filtering before persistence? What kinds= of patterns/queries and operations are needed?

having worked = on trading systems and other real-time use cases, not all stream processing= is the same.

On Thu, Dec 18, 2014 at 8:18 AM, Ryan Svihla <rsvihla@= datastax.com> wrote:
I'll decline to continue the commentary on spark, as again thi= s probably belongs on another list, other than to say, microbatches is an i= ntentional design tradeoff that has notable benefits for the same use cases= you're referring too, and that while you may disagree with those trade= offs, it's a bit harsh to dismiss as "basic" something that w= as chosen and provides some improvements over say..the Storm model.

On Thu, D= ec 18, 2014 at 7:13 AM, Peter Lin <woolfel@gmail.com> wrote:=

so= me of the most common types of use cases in stream processing is sliding wi= ndows based on time or count. Based on my understanding of spark architectu= re and spark streaming, it does not provide the same functionality. One can= fake it by setting spark streaming to really small micro-batches, but that= 's not the same.

if the use case fits that model, than usi= ng spark is fine. For other kinds of use cases, spark may not be a good fit= . Some people store all events before analyzing it, which works for some us= e cases. While other uses cases like trading systems, store before analysis= isn't feasible or practical. Other use cases like command control also= don't fit store before analysis model.

Try to avoid putti= ng the cart infront of the horse. Picking a tool before you have a clear un= derstanding of the problem is a good recipe for disaster

On Thu, Dec 18, = 2014 at 8:04 AM, Ryan Svihla <rsvihla@datastax.com> wrote= :
Since Ajay is already usin= g spark the Spark Cassandra Connector really gets them where they want to b= e pretty easily=C2=A0https://github.com/datastax/spark-cassandra-c= onnector (joins, etc).=C2=A0

As far as spark streami= ng having "basic support" I'd challenge that assertion (namel= y Storm has a number of problems with delivery guarantees that Spark basica= lly solves), however, this isn't a Spark mailing list, and perhaps this= conversation is better had there.

If the question "= ;Is Cassandra used in real time analytics cases with Spark?" the answe= r is absolutely yes (and Storm for that matter). If the question is "C= an you do your analytics queries on Cassandra while you have Spark sitting = there doing nothing?" then of course the answer is no, but that'd = be a bizzare question, they already have Spark in use.=C2=A0

On T= hu, Dec 18, 2014 at 6:52 AM, Peter Lin <woolfel@gmail.com> w= rote:
th= at depends on what you mean by real-time analytics.

For things= like continuous data streams, neither are appropriate platforms for doing = analytics. They're good for storing the results (aka output) of the str= eaming analytics. I would suggest before you decide cassandra vs hbase, fir= st figure out exactly what kind of analytics you need to do. Start with pro= totyping and look at what kind of queries and patterns you need to support.=

neither hbase or cassandra are good for complex patterns that= do joins or cross joins (aka mdx), so using either one you have to re-inve= nt stuff.

most of the event processing and stream processing p= roducts out there also don't support joins or cross joins very well, so= any solution is going to need several different components. typically stre= am processing does filtering, which feeds another system that does simple j= oins. The output of the second step can then go to another system that does= mdx style queries.

spark streaming has basic support, but it&= #39;s not as mature and feature rich as other stream processing products.

On Wed, Dec 17, 2014 at 11:20 PM, Ajay <ajay.garga@gmail.com> wrote:
Hi,
Can Cassandra be used or best fit for Real Time Analytics? I went through = couple of benchmark between Cassandra Vs HBase (most of it was done 3 years= ago) and it mentioned that Cassandra is designed for intensive writes and = Cassandra has higher latency for reads than HBase. In our case, we will hav= e writes and reads (but reads will be more say 40% writes and 60% reads). W= e are planning to use Spark as the in memory computation engine.

Thanks
<= font color=3D"#888888">
Ajay


--

3D"datastax_logo.png"

Ryan Svihla

Solution Architect


3D"twitter.png" =3D"linkedin.png"

DataStax is the fastest, most scalable distribut= ed database technology, delivering Apache Cassandra to the world=E2=80=99s = most innovative enterprises. Datastax is built to be agile, always-on, and = predictably scalable to any size. With more than 500 customers in 45 countr= ies, DataStax is the database= technology and transactional backbone of choice for the worlds most innova= tive companies such as Netflix, Adobe, Intuit, and eBay.




--

=3D"datastax_logo.png"

Ryan Svihla=

Solution Architect


3D"twitter.png" <= span style=3D"font-size:15px;font-family:Calibri;color:rgb(102,102,102);ver= tical-align:baseline;white-space:pre-wrap;background-color:transparent">3D"linkedin.png"

DataStax is the fastest, most scalable distributed database technolo= gy, delivering Apache Cassandra to the world=E2=80=99s most innovative ente= rprises. Datastax is built to be agile, always-on, and predictably scalable= to any size. With more than 500 customers in 45 countries, DataStax is the database technology and tran= sactional backbone of choice for the worlds most innovative companies such = as Netflix, Adobe, Intuit, and eBay.


<= /span>


--

=3D"datastax_logo.png"

Ryan Svihla=

Solution Architect


3D"twitter.png" <= span style=3D"font-size:15px;font-family:Calibri;color:rgb(102,102,102);ver= tical-align:baseline;white-space:pre-wrap;background-color:transparent">3D"linkedin.png"

DataStax is the fastest, most scalable distributed database technolo= gy, delivering Apache Cassandra to the world=E2=80=99s most innovative ente= rprises. Datastax is built to be agile, always-on, and predictably scalable= to any size. With more than 500 customers in 45 countries, DataStax is the database technology and tran= sactional backbone of choice for the worlds most innovative companies such = as Netflix, Adobe, Intuit, and eBay.


<= /span>
--001a1134668afb0540050a7e2928--