cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Marcelo Valle (BLOOMBERG/ LONDON)" <mvallemil...@bloomberg.net>
Subject Re: How to speed up SELECT * query in Cassandra
Date Wed, 11 Feb 2015 11:42:21 GMT
> cassandra makes a very poor datawarehouse ot long term time series store

Really? This is not the impression I have... I think Cassandra is good to store larges amounts
of data and historical information, it's only not good to store temporary data.
Netflix has a large amount of data and it's all stored in Cassandra, AFAIK. 

> The very nature of cassandra's distributed nature vs partitioning data on hadoop makes
spark on hdfs actually fasted than on cassandra.

I am not sure about the current state of Spark support for Cassandra, but I guess if you create
a map reduce job, the intermediate map results will be still stored in HDFS, as it happens
to hadoop, is this right? I think the problem with Spark + Cassandra or with Hadoop + Cassandra
is that the hard part spark or hadoop does, the shuffling, could be done out of the box with
Cassandra, but no one takes advantage on that. What if a map / reduce job used a temporary
CF in Cassandra to store intermediate results?
From: user@cassandra.apache.org 
Subject: Re: How to speed up SELECT * query in Cassandra

I use spark with cassandra, and you dont need DSE.

I see a lot of people ask this same question below (how do I get a lot of data out of cassandra?),
and my question is always, why arent you updating both places at once?

For example, we use hadoop and cassandra in conjunction with each other, we use a message
bus to store every event in both, aggregrate in both, but only keep current data in cassandra
(cassandra makes a very poor datawarehouse ot long term time series store) and then use services
to process queries that merge data from hadoop and cassandra.  

Also, spark on hdfs gives more flexibility in terms of large datasets and performance.  The
very nature of cassandra's distributed nature vs partitioning data on hadoop makes spark on
hdfs actually fasted than on cassandra....


--
Colin Clark 
+1 612 859 6129
Skype colin.p.clark

On Feb 11, 2015, at 4:49 AM, Jens Rantil <jens.rantil@tink.se> wrote:


On Wed, Feb 11, 2015 at 11:40 AM, Marcelo Valle (BLOOMBERG/ LONDON) <mvallemilita@bloomberg.net>
wrote:

If you use Cassandra enterprise, you can use hive, AFAIK.

Even better, you can use Spark/Shark with DSE.

Cheers,
Jens


-- 
Jens Rantil
Backend engineer
Tink AB

Email: jens.rantil@tink.se
Phone: +46 708 84 18 32
Web: www.tink.se

Facebook Linkedin Twitter


Mime
View raw message