cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Colin <co...@clark.ws>
Subject Re: How to speed up SELECT * query in Cassandra
Date Wed, 11 Feb 2015 15:05:02 GMT
No, the question isnt closed.  You dont get to decide that.

I dont run a website making claims regarding cassandra and spark - your employer does.   

Again, where are your benchmarks?

I will publish mine, then we'll see what you've got.

--
Colin Clark 
+1 612 859 6129
Skype colin.p.clark

> On Feb 11, 2015, at 8:39 AM, DuyHai Doan <doanduyhai@gmail.com> wrote:
> 
> For your information Colin: http://en.wikipedia.org/wiki/List_of_fallacies. Look at "Burden
of proof"
> 
> You stated "The very nature of cassandra's distributed nature vs partitioning data on
hadoop makes spark on hdfs actually fasted than on cassandra...."
> 
> It's up to YOU to prove it right, not up to me to prove it wrong.
> 
> All other bla bla is troll.
> 
> Come back to me once you get some decent benchmarks supporting your statement, until
then, the question is closed.
> 
> 
> 
>> On Wed, Feb 11, 2015 at 3:17 PM, Colin <colin@clark.ws> wrote:
>> Did you want me to included specific examples from my employment at datastax or start
from the ground up? 
>> 
>> All spark is on cassandra is a better than the previous use of hive. 
>> 
>> The fact that datastax hasnt provided any benchmarks themselves other than glossy
marketing statements pretty much says it all-where are your benchmarks?  Maybe you could combine
it with the in memory option to really boogie...
>> 
>> :)
>> 
>> (If I find time, I might just write a blog post about exactly how to do this-it involves
the use of parquet and partitioning with clustering-and it doesnt cost anything to do it-and
it's in production at my company)
>> --
>> Colin Clark 
>> +1 612 859 6129
>> Skype colin.p.clark
>> 
>>> On Feb 11, 2015, at 6:51 AM, DuyHai Doan <doanduyhai@gmail.com> wrote:
>>> 
>>> "The very nature of cassandra's distributed nature vs partitioning data on hadoop
makes spark on hdfs actually fasted than on cassandra...."
>>> 
>>> Prove it. Did you ever have a look into the source code of the Spark/Cassandra
connector to see how data locality is achieved before throwing out such statement ?
>>> 
>>>> On Wed, Feb 11, 2015 at 12:42 PM, Marcelo Valle (BLOOMBERG/ LONDON) <mvallemilita@bloomberg.net>
wrote:
>>>> > cassandra makes a very poor datawarehouse ot long term time series store
>>>> 
>>>> Really? This is not the impression I have... I think Cassandra is good to
store larges amounts of data and historical information, it's only not good to store temporary
data.
>>>> Netflix has a large amount of data and it's all stored in Cassandra, AFAIK.

>>>> 
>>>> > The very nature of cassandra's distributed nature vs partitioning data
on hadoop makes spark on hdfs actually fasted than on cassandra.
>>>> 
>>>> I am not sure about the current state of Spark support for Cassandra, but
I guess if you create a map reduce job, the intermediate map results will be still stored
in HDFS, as it happens to hadoop, is this right? I think the problem with Spark + Cassandra
or with Hadoop + Cassandra is that the hard part spark or hadoop does, the shuffling, could
be done out of the box with Cassandra, but no one takes advantage on that. What if a map /
reduce job used a temporary CF in Cassandra to store intermediate results?
>>>> 
>>>> From: user@cassandra.apache.org 
>>>> Subject: Re: How to speed up SELECT * query in Cassandra
>>>> I use spark with cassandra, and you dont need DSE.
>>>> 
>>>> I see a lot of people ask this same question below (how do I get a lot of
data out of cassandra?), and my question is always, why arent you updating both places at
once?
>>>> 
>>>> For example, we use hadoop and cassandra in conjunction with each other,
we use a message bus to store every event in both, aggregrate in both, but only keep current
data in cassandra (cassandra makes a very poor datawarehouse ot long term time series store)
and then use services to process queries that merge data from hadoop and cassandra.  
>>>> 
>>>> Also, spark on hdfs gives more flexibility in terms of large datasets and
performance.  The very nature of cassandra's distributed nature vs partitioning data on hadoop
makes spark on hdfs actually fasted than on cassandra....
>>>> 
>>>> 
>>>> 
>>>> --
>>>> Colin Clark 
>>>> +1 612 859 6129
>>>> Skype colin.p.clark
>>>> 
>>>>> On Feb 11, 2015, at 4:49 AM, Jens Rantil <jens.rantil@tink.se>
wrote:
>>>>> 
>>>>> 
>>>>>> On Wed, Feb 11, 2015 at 11:40 AM, Marcelo Valle (BLOOMBERG/ LONDON)
<mvallemilita@bloomberg.net> wrote:
>>>>>> If you use Cassandra enterprise, you can use hive, AFAIK.
>>>>> 
>>>>> Even better, you can use Spark/Shark with DSE.
>>>>> 
>>>>> Cheers,
>>>>> Jens
>>>>> 
>>>>> 
>>>>> -- 
>>>>> Jens Rantil
>>>>> Backend engineer
>>>>> Tink AB
>>>>> 
>>>>> Email: jens.rantil@tink.se
>>>>> Phone: +46 708 84 18 32
>>>>> Web: www.tink.se
>>>>> 
>>>>> Facebook Linkedin Twitter
> 

Mime
View raw message