cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From siddharth verma <sidd.verma29.l...@gmail.com>
Subject An extremely fast cassandra table full scan utility
Date Mon, 03 Oct 2016 17:51:23 GMT
Hi,
I was working on a utility which can be used for cassandra full table scan,
at a tremendously high velocity, cassandra fast full table scan.
How fast?
The script dumped ~ 229 million rows in 116 seconds, with a cluster of size
6 nodes.
Data transfer rates were upto 25MBps was observed on cassandra nodes.

For some use case, a spark cluster was required, but for some reason we
couldn't create spark cluster. Hence, one may use this utility to iterate
through the entire table at very high speed.

But now for any full scan, I use it freely for my adhoc java programs to
manipulate or aggregate cassandra data.

You can customize the options, setting fetch size, consistency level,
degree of parallelism(number of threads) according to your need.

You can visit https://github.com/siddv29/cfs to go through the code, see
the logic behind it, or try it in your program.
A sample program is also provided.

I coded this utility in java.

Bhuvan Rawal(bhu1rawal@gmail.com) and I worked on this concept.
For python you may visit his blog(
http://casualreflections.io/tech/cassandra/python/Multiprocess-Producer-Cassandra-Python)
and github(
https://gist.github.com/bhuvanrawal/93c5ae6cdd020de47e0981d36d2c0785)

Looking forward to your suggestions and comments.

P.S. Give it a try. Trust me, the iteration speed is awesome!!
It is a bare application, built asap. If you would like to contribute to
the java utility, add or build up on it, do reach out
sidd.verma29.lists@gmail.com

Thanks and Regards,
Siddharth Verma
(previous email id on this mailing list : verma.siddharth@snapdeal.com)

Mime
View raw message