cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Max C <mc_cassan...@core43.com>
Subject Re: 1, 2, 3...
Date Sat, 09 Apr 2016 22:19:18 GMT
Looks like this guy (Brian Hess) wrote a script to split the token range and run count(*) on
each subrange:

https://github.com/brianmhess/cassandra-count <https://github.com/brianmhess/cassandra-count>

- Max

> On Apr 8, 2016, at 10:56 pm, Jeff Jirsa <jeff.jirsa@crowdstrike.com> wrote:
> 
> SELECT COUNT(*) probably works (with internal paging) on many datasets with enough time
and assuming you don’t have any partitions that will kill you.
> 
> No, it doesn’t count extra replicas / duplicates.
> 
> The old way to do this (before paging / fetch size) was to use manual paging based on
tokens/clustering keys:
> 
> https://docs.datastax.com/en/cql/3.1/cql/cql_using/paging_c.html <https://docs.datastax.com/en/cql/3.1/cql/cql_using/paging_c.html>
– SELECT’s WHERE clause can use token(), which is what you’d want to use to page through
the whole token space. 
> 
> You could, in theory, issue thousands of queries in parallel, all for different token
ranges, and then sum the results. That’s what something like spark would be doing. If you
want to determine rows per node, limit the token range to that owned by the node (easier with
1 token than vnodes, with vnodes repeat num_tokens times).


Mime
View raw message