cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bhuvan Rawal <bhu1ra...@gmail.com>
Subject Re: An extremely fast cassandra table full scan utility
Date Mon, 03 Oct 2016 19:07:26 GMT
It will be interesting to have a comparison with spark here for basic use
cases.

>From a naive observation it appears that this could be slower than spark as
a lot of data is streamed over network.

On the other hand in this approach we have seen that Young GC takes nearly
full CPU (possibly because a lot of data I moved on and off heap, which has
been seen as Young Gen keeps getting empty and full sometimes multiple
times a second) and that should be there with spark as well as it will be
calling Cassandra driver, on top of that Spark cluster will be sharing same
compute resources where it does filtering/doing operations on data. If we
have an appropriately sized client machine with enough network bandwidth
this could potentially work faster, ofcourse for basic scanning use cases.

Which of these assumptions seems to be more appropriate?

On Mon, Oct 3, 2016 at 11:40 PM, DuyHai Doan <doanduyhai@gmail.com> wrote:

> Hello Siddarth
>
> I just throw an eye over the architecture diagram. The idea of using
> multiple threads, one for each token range is great. It help maxing out
> parallelism.
>
> With https://issues.apache.org/jira/browse/CASSANDRA-11521 it would be
> even faster.
>
> On Mon, Oct 3, 2016 at 7:51 PM, siddharth verma <
> sidd.verma29.list@gmail.com> wrote:
>
>> Hi,
>> I was working on a utility which can be used for cassandra full table
>> scan, at a tremendously high velocity, cassandra fast full table scan.
>> How fast?
>> The script dumped ~ 229 million rows in 116 seconds, with a cluster of
>> size 6 nodes.
>> Data transfer rates were upto 25MBps was observed on cassandra nodes.
>>
>> For some use case, a spark cluster was required, but for some reason we
>> couldn't create spark cluster. Hence, one may use this utility to iterate
>> through the entire table at very high speed.
>>
>> But now for any full scan, I use it freely for my adhoc java programs to
>> manipulate or aggregate cassandra data.
>>
>> You can customize the options, setting fetch size, consistency level,
>> degree of parallelism(number of threads) according to your need.
>>
>> You can visit https://github.com/siddv29/cfs to go through the code, see
>> the logic behind it, or try it in your program.
>> A sample program is also provided.
>>
>> I coded this utility in java.
>>
>> Bhuvan Rawal(bhu1rawal@gmail.com) and I worked on this concept.
>> For python you may visit his blog(http://casualreflections.
>> io/tech/cassandra/python/Multiprocess-Producer-Cassandra-Python) and
>> github(https://gist.github.com/bhuvanrawal/93c5ae6cdd020de47
>> e0981d36d2c0785)
>>
>> Looking forward to your suggestions and comments.
>>
>> P.S. Give it a try. Trust me, the iteration speed is awesome!!
>> It is a bare application, built asap. If you would like to contribute to
>> the java utility, add or build up on it, do reach out
>> sidd.verma29.lists@gmail.com
>>
>> Thanks and Regards,
>> Siddharth Verma
>> (previous email id on this mailing list : verma.siddharth@snapdeal.com)
>>
>
>

Mime
View raw message