cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jack Krupansky <jack.krupan...@gmail.com>
Subject Re: 1, 2, 3...
Date Mon, 11 Apr 2016 15:51:56 GMT
Unless I'm mistaken, nodetool tablestats gives you the number of partitions
(partition keys), not the number of primary keys. IOW, the term "keys" is
ambiguous. That's why I phrased the original question as count of (CQL)
rows, to distinguish from the pre-CQL3 concept of a partition being treated
as a single row.

-- Jack Krupansky

On Mon, Apr 11, 2016 at 11:46 AM, Emīls Šolmanis <emils.solmanis@gmail.com>
wrote:

> Wouldn't the "number of keys" part of *nodetool cfstats* run on every
> node, summed and divided by replication factor give you a decent
> approximation? Or are you really after a completely precise number?
>
> On Mon, 11 Apr 2016 at 16:18 Jack Krupansky <jack.krupansky@gmail.com>
> wrote:
>
>> Agreed, that anything requiring a full table scan, short of batch
>> analytics,is an antipattern, although the goal is not to do a full scan per
>> se, but just get the row count. It still surprises people that Cassandra
>> cannot quickly get COUNT(*). The easy answer: Use DSE Search and do a Solr
>> query for q=*:* and that will very quickly return the total row count. I
>> presume that Stratio will handle this fine as well.
>>
>>
>> -- Jack Krupansky
>>
>> On Mon, Apr 11, 2016 at 11:10 AM, <SEAN_R_DURITY@homedepot.com> wrote:
>>
>>> Cassandra is not good for table scan type queries (which count(*)
>>> typically is). While there are some attempts to do that (as noted below),
>>> this is a path I avoid.
>>>
>>>
>>>
>>>
>>>
>>> Sean Durity
>>>
>>>
>>>
>>> *From:* Max C [mailto:mc_cassandra@core43.com]
>>> *Sent:* Saturday, April 09, 2016 6:19 PM
>>> *To:* user@cassandra.apache.org
>>> *Subject:* Re: 1, 2, 3...
>>>
>>>
>>>
>>> Looks like this guy (Brian Hess) wrote a script to split the token range
>>> and run count(*) on each subrange:
>>>
>>>
>>>
>>> https://github.com/brianmhess/cassandra-count
>>>
>>>
>>>
>>> - Max
>>>
>>>
>>>
>>> On Apr 8, 2016, at 10:56 pm, Jeff Jirsa <jeff.jirsa@crowdstrike.com>
>>> wrote:
>>>
>>>
>>>
>>> SELECT COUNT(*) probably works (with internal paging) on many datasets
>>> with enough time and assuming you don’t have any partitions that will kill
>>> you.
>>>
>>>
>>>
>>> No, it doesn’t count extra replicas / duplicates.
>>>
>>>
>>>
>>> The old way to do this (before paging / fetch size) was to use manual
>>> paging based on tokens/clustering keys:
>>>
>>>
>>>
>>> https://docs.datastax.com/en/cql/3.1/cql/cql_using/paging_c.html –
>>> SELECT’s WHERE clause can use token(), which is what you’d want to use to
>>> page through the whole token space.
>>>
>>>
>>>
>>> You could, in theory, issue thousands of queries in parallel, all for
>>> different token ranges, and then sum the results. That’s what something
>>> like spark would be doing. If you want to determine rows per node, limit
>>> the token range to that owned by the node (easier with 1 token than vnodes,
>>> with vnodes repeat num_tokens times).
>>>
>>>
>>>
>>> ------------------------------
>>>
>>> The information in this Internet Email is confidential and may be
>>> legally privileged. It is intended solely for the addressee. Access to this
>>> Email by anyone else is unauthorized. If you are not the intended
>>> recipient, any disclosure, copying, distribution or any action taken or
>>> omitted to be taken in reliance on it, is prohibited and may be unlawful.
>>> When addressed to our clients any opinions or advice contained in this
>>> Email are subject to the terms and conditions expressed in any applicable
>>> governing The Home Depot terms of business or client engagement letter. The
>>> Home Depot disclaims all responsibility and liability for the accuracy and
>>> content of this attachment and for any damages or losses arising from any
>>> inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other
>>> items of a destructive nature, which may be contained in this attachment
>>> and shall not be liable for direct, indirect, consequential or special
>>> damages in connection with this e-mail message or its attachment.
>>>
>>
>>

Mime
View raw message