cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Xu Zhongxing" <>
Subject Re:full-tabe scan - extracting all data from C*
Date Wed, 28 Jan 2015 01:33:32 GMT
Both Java driver "select * from table" and Spark sc.cassandraTable() work well. 
I use both of them frequently.

At 2015-01-28 04:06:20, "Mohammed Guller" <> wrote:

Hi –


Over the last few weeks, I have seen several emails on this mailing list from people trying
to extract all data from C*, so that they can import that data into other analytical tools
that provide much richer analytics functionality than C*. Extracting all data from C* is a
full-table scan, which is not the ideal use case for C*. However, people don’t have much
choice if they want to do ad-hoc analytics on the data in C*. Unfortunately, I don’t think
C* comes with any built-in tools that make this task easy for a large dataset. Please correct
me if I am wrong. Cqlsh has a COPY TO command, but it doesn’t really work if you have a
large amount of data in C*.


I am aware of couple of approaches for extracting all data from a table in C*:

1)      Iterate through all the C* partitions (physical rows) using the Java Driver and CQL.

2)      Extract the data directly from SSTables files.


Either approach can be used with Hadoop or Spark to speed up the extraction process.


I wanted to do a quick survey and find out how many people on this mailing list have successfully
used approach #1 or #2 for extracting large datasets (terabytes) from C*. Also, if you have
used some other techniques, it would be great if you could share your approach with the group.



View raw message