incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From aaron morton <aa...@thelastpickle.com>
Subject Re: Is Cassandra oversized for this kind of use case?
Date Sun, 28 Apr 2013 20:15:50 GMT
Sounds like something C* would be good at. 

I would do some searching on Time Series data in cassandra, such as http://www.datastax.com/dev/blog/advanced-time-series-with-cassandra
And definitely consider storing data at the smallest level on granularity. 

On the analytics side there is good news and no so good news. First the good news is reads
do not block writes as in a traditional RDBMS (without MVCC) running with Transaction Isolation
of Repeatable Read or higher. 

The not the so good news it's not as easy to support the wide range of analytical queries
that you are used to with SQL using the standard Thrift/CQL API. If you need very flexible
analysis I recommend looking into Hive / Pig with Hadoop, DataStax Enterprise is a commercial
product but free for development and a great way to learn without having to worry about the
setup http://www.datastax.com/

You may also be interested in http://www.pentaho.com/ or http://www.karmasphere.com/

Hope that helps. 

-----------------
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 27/04/2013, at 5:26 AM, "Hiller, Dean" <Dean.Hiller@nrel.gov> wrote:

> I would at least start with 3 cheap nodes with RF=3 and start with CL=TWO on writes and
reads most likely getting your feet wet.  Don't buy very expensive computers like a lot do
getting into the game for the first time…Every time I walk into a new gig, they seem to
think they need to spend 6/10k per node.  I think this kind of scenario sounds find to use
cassandra.  When you say virtualize, I believe you mean "use Vms"…..many use Amazon Vms
and there is stuff to configure if you are on amazon specifically for this.
> 
> If you are on your own VM's, you do need to worry about if two nodes end up on the same
hardware stealing resources from each other or if hardware fails as well.  Ie. The idea in
noSQL is you typically have 3 copies of all data so if one node goes down, you are still live
with CL=TWO.
> 
> Also, plan on doing ~300GB per node typically depending on how it works out in testing.
> 
> Later,
> Dean
> 
> From: Marc Teufel <teufel.marc@googlemail.com<mailto:teufel.marc@googlemail.com>>
> Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" <user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
> Date: Friday, April 26, 2013 10:59 AM
> To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" <user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
> Subject: Re: Is Cassandra oversized for this kind of use case?
> 
> Okay one billion rows of data is a lot, compared to that i am far far away - means i
can stay with Oracle? Maybe.
> But you're right when you say its not only about big data but also about your need.
> 
> So storing the data is one part, doing analytical analysis is the second. I do a lot
of calculations and queries to generate management criteria about how the production is going
on actually, how the production went the last week, month, years and so on. Saving in a 5
minute rhythm is only a compromise to reduce the amount of data - maybe in the future the
usecase will change an is about to store status of each machine as soon as it changes. This
will of course increase the amount of data and the complexity of my queries again. And sure
I show "Live" Data today... 5 Minute old Live Data... but if i tell the CEO that i am also
able to work with real live data, i am sure this is what he wants to get .... ;-)
> 
> Can you recommend me to use Cassandra for this kind of scenario or is this oversized
?
> 
> Does it makes sense to start with 2 Nodes ?
> 
> Can i virtualize these two Nodes ?
> 
> 
> Thx a lot for your assistance.
> 
> Marc
> 
> 
> 
> 
> 2013/4/26 Hiller, Dean <Dean.Hiller@nrel.gov<mailto:Dean.Hiller@nrel.gov>>
> Well, it depends more on what you will do with the data.  I know I was on a sybase(RDBMS)
with 1 billion rows but it was getting close to not being able to handle more (constraints
had to be turned off and all sorts of optimizations done and expert consultants brought in
and everything).
> 
> BUT there are other use cases where noSQL is great for (ie. It is not just great for
big data type systems).  It is great for really high write throughput as you can add more
nodes and handle more writes/second than an RDBMS very easily yet you may be doing so many
deletes that the system constantly stays at a small data set.
> 
> You may want to analyze the data constantly or near real time involving huge amounts
of reads / second in which case noSQL can be better as well.
> 
> Ie. Nosql is not just for big data.  I know with PlayOrm for cassandra, we have handled
many different use cases out there.
> 
> Later,
> Dean
> 
> From: Marc Teufel <teufel.marc@googlemail.com<mailto:teufel.marc@googlemail.com><mailto:teufel.marc@googlemail.com<mailto:teufel.marc@googlemail.com>>>
> Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org><mailto:user@cassandra.apache.org<mailto:user@cassandra.apache.org>>"
<user@cassandra.apache.org<mailto:user@cassandra.apache.org><mailto:user@cassandra.apache.org<mailto:user@cassandra.apache.org>>>
> Date: Friday, April 26, 2013 8:17 AM
> To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org><mailto:user@cassandra.apache.org<mailto:user@cassandra.apache.org>>"
<user@cassandra.apache.org<mailto:user@cassandra.apache.org><mailto:user@cassandra.apache.org<mailto:user@cassandra.apache.org>>>
> Subject: Is Cassandra oversized for this kind of use case?
> 
> I hope the Cassandra Community can help me finding a decision.
> 
> The project i am working on actually is located in industrial plant, machines are connected
to a server an every 5 minutes i get data from the machines about its status. We are talking
about a production with 100+ machines, so the data amount is very high:
> 
> Per Machine every 5th minute one row,
> means 12 rows per hour, means roundabout 120 rows per day = 1200+ rows per day
> multiplied by 20 its 240.000 rows per month and 2.880.000 rows per year. I have to hold
> the last 3 years and i must be able to do analytics on this data. in the end i deal with
roundabout 10 Mio Rows (12 columns holding text and numbers each row)
> Okay, its kind of big data is not really  "big data" isn'it  but for me its a lot data
to handle anyway.
> Actually i am holding all these data in a oracle database but doing analytics on so many
rows
> is not the good and modern way i think. as the company is successfull they will grew,
means more machines, again more data to handle...
> 
> So i thought maybe Big Data technologies are a possible solution for me to store my data.
> 
> Meanwhile i know Apache Hadoop is not the right tool for this kind of thing because it
scales not down.But maybe Cassandra ? This is my question to you, do you think cassandra is
the right store for this kind of data?
> 
> I am thinking about 2 Nodes. Maybe virtual.
> 
> Let me know what you think. And if Cassandra is not the right tool please tell me and
if you know any please tell me alternatives. Maybe i am already doing the right thing with
storing that much data in oracle database and maybe one of you is doing the same - if so please
let me also know.
> 
> Thank you very much.
> 
> 
> Web: http://www.teufel.net
> 
> 
> 
> --
> Mail: teufel.marc@gmail.com<mailto:teufel.marc@gmail.com>
> Web: http://www.teufel.net


Mime
View raw message