Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 5693919461 for ; Mon, 11 Apr 2016 15:46:39 +0000 (UTC) Received: (qmail 137 invoked by uid 500); 11 Apr 2016 15:46:33 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 99986 invoked by uid 500); 11 Apr 2016 15:46:32 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 99968 invoked by uid 99); 11 Apr 2016 15:46:32 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 11 Apr 2016 15:46:32 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 7D8E0C1357 for ; Mon, 11 Apr 2016 15:46:32 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.179 X-Spam-Level: * X-Spam-Status: No, score=1.179 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx2-lw-us.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id R9xZqjhFyM_O for ; Mon, 11 Apr 2016 15:46:30 +0000 (UTC) Received: from mail-wm0-f48.google.com (mail-wm0-f48.google.com [74.125.82.48]) by mx2-lw-us.apache.org (ASF Mail Server at mx2-lw-us.apache.org) with ESMTPS id F0CD65F20E for ; Mon, 11 Apr 2016 15:46:29 +0000 (UTC) Received: by mail-wm0-f48.google.com with SMTP id a140so18543406wma.0 for ; Mon, 11 Apr 2016 08:46:29 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:references:in-reply-to:from:date:message-id:subject:to; bh=0cRW6IQ6FI56wfdQ2DSc7vCcHDM/61FP4UX9qkB7V3M=; b=Fp1XtgRiM125j2jOzoKr7bAf9Tc8ZYu1G+ShSpU3AXgP0kxPRkpzEHKeb1wEa2t64q MjRTBw/ES0zYyDlwVLdaqAIHoOA4cx7qKDpXwneg3HITwYS0r6VhgDHC5rn68FrQbSHx K0XE2U9niFjksIst1luZWXNz1UYFB5O5yXf6xPGprPn1g18/5zHpIJCe+dGixDPNO6b7 3adxmet+fvDFAQcreKHsGFXYwJCkU71v0fAHVk1tMhxKrKwPuH/C4T52BrJLwssp38PH hY1bBJbne0ZF7quSoS70l4QwvvSIqE8p3oXzj92YPGGfVrI3bHYqHls/XzOg1RI35Dq3 egvw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=0cRW6IQ6FI56wfdQ2DSc7vCcHDM/61FP4UX9qkB7V3M=; b=iddKjjEt40KIDNRUTKPd1AB2LpWiVfUFWNoJRH0Sp89/N4CHsQOLeaQH7awUfn++lm e94XApahTRqfvCH+3tHxNxH8MGSQ2Z4GuIeW60UgrBRBC4Yfo5il3Yj2y1BJjt0dJM8f scC3fXaahlgnvxXwZ5Pg0dSA8Ib8wTgdY0PGvlaOswsHcg2RFjdkbD6MgfG7ChQP1d+V NCyAosTSdbzLqab9GNU7cjdnnVrg55kCEG3VpdsJHdFYP8JE3auCjOlxv74vTiZMA+b+ 1tsZS4DFA9D6jTU0Ni9743GyT2F9xp8EdN/pBgjDEy5VXp0tTVnXSz9o4KNLufoHd+oS IyGA== X-Gm-Message-State: AD7BkJKEp1R496hCRwYXELvxck7PReX/XG1w1pVntQP2WVeAatyxBMaVCB+ZmOMee2go29WXxZ9JKMX16oMulQ== X-Received: by 10.194.171.66 with SMTP id as2mr24597109wjc.110.1460389588808; Mon, 11 Apr 2016 08:46:28 -0700 (PDT) MIME-Version: 1.0 References: <7F285C73-A447-4D47-93D9-4294B25D7470@core43.com> <156AA069D74CB042AA529A6900665253248E84D6@CO1PRD6102MB001.025d.mgd.msft.net> In-Reply-To: From: =?UTF-8?B?RW3Eq2xzIMWgb2xtYW5pcw==?= Date: Mon, 11 Apr 2016 15:46:18 +0000 Message-ID: Subject: Re: 1, 2, 3... To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=089e0122f054acb6e00530377287 --089e0122f054acb6e00530377287 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Wouldn't the "number of keys" part of *nodetool cfstats* run on every node, summed and divided by replication factor give you a decent approximation? Or are you really after a completely precise number? On Mon, 11 Apr 2016 at 16:18 Jack Krupansky wrote: > Agreed, that anything requiring a full table scan, short of batch > analytics,is an antipattern, although the goal is not to do a full scan p= er > se, but just get the row count. It still surprises people that Cassandra > cannot quickly get COUNT(*). The easy answer: Use DSE Search and do a Sol= r > query for q=3D*:* and that will very quickly return the total row count. = I > presume that Stratio will handle this fine as well. > > > -- Jack Krupansky > > On Mon, Apr 11, 2016 at 11:10 AM, wrote: > >> Cassandra is not good for table scan type queries (which count(*) >> typically is). While there are some attempts to do that (as noted below)= , >> this is a path I avoid. >> >> >> >> >> >> Sean Durity >> >> >> >> *From:* Max C [mailto:mc_cassandra@core43.com] >> *Sent:* Saturday, April 09, 2016 6:19 PM >> *To:* user@cassandra.apache.org >> *Subject:* Re: 1, 2, 3... >> >> >> >> Looks like this guy (Brian Hess) wrote a script to split the token range >> and run count(*) on each subrange: >> >> >> >> https://github.com/brianmhess/cassandra-count >> >> >> >> - Max >> >> >> >> On Apr 8, 2016, at 10:56 pm, Jeff Jirsa >> wrote: >> >> >> >> SELECT COUNT(*) probably works (with internal paging) on many datasets >> with enough time and assuming you don=E2=80=99t have any partitions that= will kill >> you. >> >> >> >> No, it doesn=E2=80=99t count extra replicas / duplicates. >> >> >> >> The old way to do this (before paging / fetch size) was to use manual >> paging based on tokens/clustering keys: >> >> >> >> https://docs.datastax.com/en/cql/3.1/cql/cql_using/paging_c.html =E2=80= =93 >> SELECT=E2=80=99s WHERE clause can use token(), which is what you=E2=80= =99d want to use to >> page through the whole token space. >> >> >> >> You could, in theory, issue thousands of queries in parallel, all for >> different token ranges, and then sum the results. That=E2=80=99s what so= mething >> like spark would be doing. If you want to determine rows per node, limit >> the token range to that owned by the node (easier with 1 token than vnod= es, >> with vnodes repeat num_tokens times). >> >> >> >> ------------------------------ >> >> The information in this Internet Email is confidential and may be legall= y >> privileged. It is intended solely for the addressee. Access to this Emai= l >> by anyone else is unauthorized. If you are not the intended recipient, a= ny >> disclosure, copying, distribution or any action taken or omitted to be >> taken in reliance on it, is prohibited and may be unlawful. When address= ed >> to our clients any opinions or advice contained in this Email are subjec= t >> to the terms and conditions expressed in any applicable governing The Ho= me >> Depot terms of business or client engagement letter. The Home Depot >> disclaims all responsibility and liability for the accuracy and content = of >> this attachment and for any damages or losses arising from any >> inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or othe= r >> items of a destructive nature, which may be contained in this attachment >> and shall not be liable for direct, indirect, consequential or special >> damages in connection with this e-mail message or its attachment. >> > > --089e0122f054acb6e00530377287 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Wouldn't the "number of keys" part of=C2=A0<= i>nodetool cfstats=C2=A0run on every node, summed and divided by replic= ation factor give you a decent approximation? Or are you really after a com= pletely precise number?

On Mon, 11 Apr 2016 at 16:18 Jack Krupansky <jack.krupansky@gmail.com> wrote:
Agreed, that anything requiring a = full table scan, short of batch analytics,is an antipattern, although the g= oal is not to do a full scan per se, but just get the row count. It still s= urprises people that Cassandra cannot quickly get COUNT(*). The easy answer= : Use DSE Search and do a Solr query for q=3D*:* and that will very quickly= return the total row count. I presume that Stratio will handle this fine a= s well.


-- Jack Krupansky=

On Mon, Apr 11, 2016 at 11:10 AM, <SEAN_R_DURITY@homedepot.com> wrote:

Cassandra is not good for= table scan type queries (which count(*) typically is). While there are som= e attempts to do that (as noted below), this is a path I avoid.

=C2=A0

=C2=A0

Sean Durity=

=C2=A0

From: Max C = [mailto:mc_cas= sandra@core43.com]
Sent: Saturday, April 09, 2016 6:19 PM
To: u= ser@cassandra.apache.org
Subject: Re: 1, 2, 3...

=C2=A0

Looks like this guy (Brian Hess) wrote a script to s= plit the token range and run count(*) on each subrange:

=C2=A0

https://github.com/brianmhess/cassandra-count

=C2=A0

- Max

=C2=A0

On Apr 8, 2016, at 10:56 pm, Jeff Jirsa <jeff.jirsa@crowdst= rike.com> wrote:

=C2=A0

SELECT COUNT(*) probably works (with in= ternal paging) on many datasets with enough time and assuming you don=E2=80= =99t have any partitions that will kill you.

=C2=A0

No, it doesn=E2=80=99t count extra repl= icas / duplicates.

=C2=A0

The old way to do this (before paging /= fetch size) was to use manual paging based on tokens/clustering keys:

=C2=A0

https://docs.datast= ax.com/en/cql/3.1/cql/cql_using/paging_c.html=C2=A0=E2=80=93 SELECT=E2= =80=99s WHERE clause can use token(), which is what you=E2=80=99d want to use to page through the w= hole token space.=C2=A0

=C2=A0

You could, in theory, issue thousands o= f queries in parallel, all for different token ranges, and then sum the res= ults. That=E2=80=99s what something like spark would be doing. If you want to determine rows per node, limit the token range to that owned b= y the node (easier with 1 token than vnodes, with vnodes repeat num_tokens = times).

=C2=A0




The information in this Internet Email is confidential and may be legally p= rivileged. It is intended solely for the addressee. Access to this Email by= anyone else is unauthorized. If you are not the intended recipient, any di= sclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibite= d and may be unlawful. When addressed to our clients any opinions or advice= contained in this Email are subject to the terms and conditions expressed = in any applicable governing The Home Depot terms of business or client engagement letter. The Home Depot d= isclaims all responsibility and liability for the accuracy and content of t= his attachment and for any damages or losses arising from any inaccuracies,= errors, viruses, e.g., worms, trojan horses, etc., or other items of a destructive nature, which may be contain= ed in this attachment and shall not be liable for direct, indirect, consequ= ential or special damages in connection with this e-mail message or its att= achment.

--089e0122f054acb6e00530377287--