cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Brent N. Chun" <>
Subject Reading all rows in a column family in parallel
Date Thu, 08 Jul 2010 07:21:44 GMT

I'm running Cassandra 0.6.0 on a cluster and have an application that 
needs to read all rows from a column family using the Cassandra Thrift 
API. Ideally, I'd like to be able to do this by having all nodes in the 
cluster read in parallel (i.e., each node reads a disjoint set of rows 
that are stored locally). I should also mention that I'm using the 

Here's what I was thinking:

   1. Have one node invoke describe_ring to find the token range on the 
ring that each node is responsible for.

   2. For each token range, have the node that owns that portion of the 
ring read the rows in that range using a sequence of get_range_slices 
calls (using start/end tokens, not keys).

This type of functionality seems to already be there in the tree with 
the recent Cassandra/Hadoop integration.

KeyRange keyRange = new KeyRange(batchRowCount)
     rows = client.get_range_slices(new ColumnParent(cfName),

     // prepare for the next slice to be read
     KeySlice lastRow = rows.get(rows.size() - 1);
     IPartitioner p = DatabaseDescriptor.getPartitioner();
     byte[] rowkey = lastRow.getKey();
     startToken = p.getTokenFactory().toString(p.getToken(rowkey));

The above snippet from seems to suggest it 
is possible to scan an entire column family by reading disjoint sets of 
rows using token-based range queries (as opposed to key-based range 
queries). Is this possible in 0.6.0? (Note: for the next startToken, I 
was just planning on computing the MD5 digest of the last key directly 
since I'm accessing Cassandra through Thrift.)



View raw message