cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tobias Eriksson <tobias.eriks...@qvantel.com>
Subject How can I efficiently export the content of my table to KAFKA
Date Wed, 26 Apr 2017 19:49:33 GMT
Hi
I would like to make a dump of the database, in JSON format, to KAFKA
The database contains lots of data, millions and in some cases billions of “rows”
I will provide the customer with an export of the data, where they can read it off of a KAFKA
topic

My thinking was to have it scalable such that I will distribute the token range of all available
partition-keys to a number of (N) processes (JSON-Producers)
First I will have a process which will read through the available tokens and then publish
them on a KAFKA “Coordinator” Topic
And then I can create 1, 10, 20 or N processes that will act as Producers to the real KAFKA
topic, and pick available tokens/partition-keys off of the “Coordinator” Topic
One by one until all the “rows” have been processed.
So the JOSN-Producer will take e.g. a range of 1000 “rows” and convert them into my own
JSON format and post to KAFKA
And then after that take another 1000 “rows” and then …. And then another 1000 “rows”
and so on, until it is done.

I base my idea on how I believe Apache Spark Connector accomplishes data locality, i.e. being
aware of where tokens reside and figured that since that is possible it should be possible
to create a job-list in a KAFKA topic, and have each Producer pick jobs from there, and read
up data from Cassandra based on the partition key (token) and then post the JSON on the export
KAFKA topic.
https://dzone.com/articles/data-locality-w-cassandra-how


Would you consider this a good idea ?
Would there in fact be a better idea, what would that be then ?

-Tobias

Mime
View raw message