cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Lin <wool...@gmail.com>
Subject Re: Cassandra for Analytics?
Date Thu, 18 Dec 2014 14:31:45 GMT
by data warehouse, what kind do you mean?

is it the traditional warehouse where people create multi-dimensional cubes?
or is it the newer class of UI tools that makes it easier for users to
explore data and the warehouse is "mostly" a denormalized (ie flattened)
format of the OLTP?
or is it a combination of both?

from my experience, the biggest challenge of data warehousing isn't storing
the data. It's making it easy to explore for adhoc mdx-like queries. In the
old days, the DBA's would define the cubes, write the ETL routines and let
the data load for days/weeks. In the new nosql model, you can avoid the
cube + ETL phase, but discovering the data and understanding the format
still requires a developer.

getting the data into an "user friendly" format like a cube with Spark
still requires a developer. I find that business users hate to go to the
developer, because we tend to ask "what's the functional specs?" Most of
the time business users don't know, they just want to explore. At that
point, the storage engine largely doesn't matter to the end user. It
matters to the developers, but business users don't care.

based on the description, I would watch out for how many aggregated views
the platform creates. search the mailing list to see past discussions on
the maximum recommended number of column families.

where classic data warehouse caused lots of pain is creating cubes. Any
general solution attempting to replace/supplement existing products needs
to make it easy and trivial to define adhoc cubes and then query against
it. There are existing products that already connect to a few nosql
databases for data exploration. hope that helps

peter



On Thu, Dec 18, 2014 at 9:01 AM, Ajay <ajay.garga@gmail.com> wrote:
>
> Thanks Ryan and Peter for the suggestions.
>
> Our requirement(an ecommerce company) at a higher level is to build a
> Datawarehouse as a platform or service(for different product teams to
> consume) as below:
>
> Datawarehouse as a platform/service
>                      |
>             Spark SQL
>                      |
> Spark in memory computation engine (We were considering Drill/Flink but
> Spark is better mature and in production)
>                      |
>         Cassandra/HBase (Yet to be decided. Aggregated views + data
> directly written to this. So 40%-50% writes, 50-60% reads)
>                      |
>         Streaming processing (Spark Streaming or Storm. Yet to be decided.
> Spark streaming is relatively new)
>                     |
>          My SQL/Mongo/Real Time data
>
> Since we are planning to build it as a service, we cannot consider a
> particular data access pattern.
>
> Thanks
> Ajay
>
>
> On Thu, Dec 18, 2014 at 7:00 PM, Peter Lin <woolfel@gmail.com> wrote:
>>
>>
>> for the record I think spark is good and I'm glad we have options.
>>
>> my point wasn't to bad mouth spark. I'm not comparing spark to storm at
>> all, so I think there's some confusion here. I'm thinking of espers,
>> streambase, and other stream processing products. My point is to think
>> about the problems that needs to be solved before picking a solution. Like
>> everyone else, I've been guilty of this in the past, so it's not propaganda
>> for or against any specific product.
>>
>> I've seen customers user IBM infosphere streams when something like storm
>> or spark would work, but I've also seen cases where open source doesn't
>> provide equivalent functionality. If spark meets the needs, then either
>> hbase or cassandra will probably work fine. The bigger question is what
>> patterns do you use in the architecture? Do you store the data first before
>> doing analysis? Is the data noisy and needs filtering before persistence?
>> What kinds of patterns/queries and operations are needed?
>>
>> having worked on trading systems and other real-time use cases, not all
>> stream processing is the same.
>>
>> On Thu, Dec 18, 2014 at 8:18 AM, Ryan Svihla <rsvihla@datastax.com>
>> wrote:
>>>
>>> I'll decline to continue the commentary on spark, as again this probably
>>> belongs on another list, other than to say, microbatches is an intentional
>>> design tradeoff that has notable benefits for the same use cases you're
>>> referring too, and that while you may disagree with those tradeoffs, it's a
>>> bit harsh to dismiss as "basic" something that was chosen and provides some
>>> improvements over say..the Storm model.
>>>
>>> On Thu, Dec 18, 2014 at 7:13 AM, Peter Lin <woolfel@gmail.com> wrote:
>>>>
>>>>
>>>> some of the most common types of use cases in stream processing is
>>>> sliding windows based on time or count. Based on my understanding of spark
>>>> architecture and spark streaming, it does not provide the same
>>>> functionality. One can fake it by setting spark streaming to really small
>>>> micro-batches, but that's not the same.
>>>>
>>>> if the use case fits that model, than using spark is fine. For other
>>>> kinds of use cases, spark may not be a good fit. Some people store all
>>>> events before analyzing it, which works for some use cases. While other
>>>> uses cases like trading systems, store before analysis isn't feasible or
>>>> practical. Other use cases like command control also don't fit store before
>>>> analysis model.
>>>>
>>>> Try to avoid putting the cart infront of the horse. Picking a tool
>>>> before you have a clear understanding of the problem is a good recipe for
>>>> disaster
>>>>
>>>> On Thu, Dec 18, 2014 at 8:04 AM, Ryan Svihla <rsvihla@datastax.com>
>>>> wrote:
>>>>>
>>>>> Since Ajay is already using spark the Spark Cassandra Connector really
>>>>> gets them where they want to be pretty easily
>>>>> https://github.com/datastax/spark-cassandra-connector (joins, etc).
>>>>>
>>>>> As far as spark streaming having "basic support" I'd challenge that
>>>>> assertion (namely Storm has a number of problems with delivery guarantees
>>>>> that Spark basically solves), however, this isn't a Spark mailing list,
and
>>>>> perhaps this conversation is better had there.
>>>>>
>>>>> If the question "Is Cassandra used in real time analytics cases with
>>>>> Spark?" the answer is absolutely yes (and Storm for that matter). If
the
>>>>> question is "Can you do your analytics queries on Cassandra while you
have
>>>>> Spark sitting there doing nothing?" then of course the answer is no,
but
>>>>> that'd be a bizzare question, they already have Spark in use.
>>>>>
>>>>> On Thu, Dec 18, 2014 at 6:52 AM, Peter Lin <woolfel@gmail.com>
wrote:
>>>>>>
>>>>>> that depends on what you mean by real-time analytics.
>>>>>>
>>>>>> For things like continuous data streams, neither are appropriate
>>>>>> platforms for doing analytics. They're good for storing the results
(aka
>>>>>> output) of the streaming analytics. I would suggest before you decide
>>>>>> cassandra vs hbase, first figure out exactly what kind of analytics
you
>>>>>> need to do. Start with prototyping and look at what kind of queries
and
>>>>>> patterns you need to support.
>>>>>>
>>>>>> neither hbase or cassandra are good for complex patterns that do
>>>>>> joins or cross joins (aka mdx), so using either one you have to re-invent
>>>>>> stuff.
>>>>>>
>>>>>> most of the event processing and stream processing products out there
>>>>>> also don't support joins or cross joins very well, so any solution
is going
>>>>>> to need several different components. typically stream processing
does
>>>>>> filtering, which feeds another system that does simple joins. The
output of
>>>>>> the second step can then go to another system that does mdx style
queries.
>>>>>>
>>>>>> spark streaming has basic support, but it's not as mature and feature
>>>>>> rich as other stream processing products.
>>>>>>
>>>>>> On Wed, Dec 17, 2014 at 11:20 PM, Ajay <ajay.garga@gmail.com>
wrote:
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> Can Cassandra be used or best fit for Real Time Analytics? I
went
>>>>>>> through couple of benchmark between Cassandra Vs HBase (most
of it was done
>>>>>>> 3 years ago) and it mentioned that Cassandra is designed for
intensive
>>>>>>> writes and Cassandra has higher latency for reads than HBase.
In our case,
>>>>>>> we will have writes and reads (but reads will be more say 40%
writes and
>>>>>>> 60% reads). We are planning to use Spark as the in memory computation
>>>>>>> engine.
>>>>>>>
>>>>>>> Thanks
>>>>>>> Ajay
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> [image: datastax_logo.png] <http://www.datastax.com/>
>>>>>
>>>>> Ryan Svihla
>>>>>
>>>>> Solution Architect
>>>>>
>>>>> [image: twitter.png] <https://twitter.com/foundev> [image:
>>>>> linkedin.png] <http://www.linkedin.com/pub/ryan-svihla/12/621/727/>
>>>>>
>>>>> DataStax is the fastest, most scalable distributed database
>>>>> technology, delivering Apache Cassandra to the world’s most innovative
>>>>> enterprises. Datastax is built to be agile, always-on, and predictably
>>>>> scalable to any size. With more than 500 customers in 45 countries, DataStax
>>>>> is the database technology and transactional backbone of choice for the
>>>>> worlds most innovative companies such as Netflix, Adobe, Intuit, and
eBay.
>>>>>
>>>>>
>>>
>>> --
>>>
>>> [image: datastax_logo.png] <http://www.datastax.com/>
>>>
>>> Ryan Svihla
>>>
>>> Solution Architect
>>>
>>> [image: twitter.png] <https://twitter.com/foundev> [image: linkedin.png]
>>> <http://www.linkedin.com/pub/ryan-svihla/12/621/727/>
>>>
>>> DataStax is the fastest, most scalable distributed database technology,
>>> delivering Apache Cassandra to the world’s most innovative enterprises.
>>> Datastax is built to be agile, always-on, and predictably scalable to any
>>> size. With more than 500 customers in 45 countries, DataStax is the
>>> database technology and transactional backbone of choice for the worlds
>>> most innovative companies such as Netflix, Adobe, Intuit, and eBay.
>>>
>>>

Mime
View raw message