cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Jirsa <jeff.ji...@crowdstrike.com>
Subject Re: 1, 2, 3...
Date Sat, 09 Apr 2016 05:56:35 GMT
SELECT COUNT(*) probably works (with internal paging) on many datasets with enough time and
assuming you don’t have any partitions that will kill you.

No, it doesn’t count extra replicas / duplicates.

The old way to do this (before paging / fetch size) was to use manual paging based on tokens/clustering
keys:

https://docs.datastax.com/en/cql/3.1/cql/cql_using/paging_c.html – SELECT’s WHERE clause
can use token(), which is what you’d want to use to page through the whole token space.


You could, in theory, issue thousands of queries in parallel, all for different token ranges,
and then sum the results. That’s what something like spark would be doing. If you want to
determine rows per node, limit the token range to that owned by the node (easier with 1 token
than vnodes, with vnodes repeat num_tokens times).



From:  Jack Krupansky
Reply-To:  "user@cassandra.apache.org"
Date:  Friday, April 8, 2016 at 3:48 PM
To:  "user@cassandra.apache.org"
Subject:  1, 2, 3...

I'm afraid I don't have the solid answer to this obvious question: How do I get a fairly accurate
count of (CQL) rows in a Cassandra table? 

Does SELECT COUNT (*) FROM <table-name> actually do it?

Does it really count (CQL) rows across all nodes and exclude replicated rows?

Is there a better/preferred technique? For example, is it more efficient to query the row
count one node at a time?

And for bonus points: How do you count (CQL) rows for each node? Again, excluding replication.

-- Jack Krupansky


Mime
View raw message