Return-Path: X-Original-To: apmail-cassandra-commits-archive@www.apache.org Delivered-To: apmail-cassandra-commits-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 8010217736 for ; Fri, 17 Apr 2015 21:45:59 +0000 (UTC) Received: (qmail 96512 invoked by uid 500); 17 Apr 2015 21:45:59 -0000 Delivered-To: apmail-cassandra-commits-archive@cassandra.apache.org Received: (qmail 96466 invoked by uid 500); 17 Apr 2015 21:45:59 -0000 Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@cassandra.apache.org Delivered-To: mailing list commits@cassandra.apache.org Received: (qmail 96454 invoked by uid 99); 17 Apr 2015 21:45:59 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 17 Apr 2015 21:45:59 +0000 Date: Fri, 17 Apr 2015 21:45:59 +0000 (UTC) From: "Chris Lohfink (JIRA)" To: commits@cassandra.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (CASSANDRA-9107) More accurate row count estimates MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/CASSANDRA-9107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14500746#comment-14500746 ] Chris Lohfink commented on CASSANDRA-9107: ------------------------------------------ I like having the MT count included, when people run some simple small tests it will show up then. I think it can confuse people if they insert some data and the value doesn't go up. > More accurate row count estimates > --------------------------------- > > Key: CASSANDRA-9107 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9107 > Project: Cassandra > Issue Type: Improvement > Reporter: Chris Lohfink > Assignee: Chris Lohfink > Attachments: 9107-cassandra2-1.patch > > > Currently the estimated row count from cfstats is the sum of the number of rows in all the sstables. This becomes very inaccurate with wide rows or heavily updated datasets since the same partition would exist in many sstables. In example: > {code} > create KEYSPACE test WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1}; > create TABLE wide (key text PRIMARY KEY , value text) WITH compaction = {'class': 'SizeTieredCompactionStrategy', 'min_threshold': 30, > 'max_threshold': 100} ; > ------------------------------- > insert INTO wide (key, value) VALUES ('key', 'value'); > // flush > // cfstats output: Number of keys (estimate): 1 (128 in older version from index) > insert INTO wide (key, value) VALUES ('key', 'value'); > // flush > // cfstats output: Number of keys (estimate): 2 (256 in older version from index) > ... etc > {code} > previously it used the index but it still did it per sstable and summed them up which became inaccurate as there are more sstables (just by much worse). With new versions of sstables we can merge the cardinalities to resolve this with a slight hit to accuracy in the case of every sstable having completely unique partitions. > Furthermore I think it would be pretty minimal effort to include the number of rows in the memtables to this count. We wont have the cardinality merging between memtables and sstables but I would consider that a relatively minor negative. -- This message was sent by Atlassian JIRA (v6.3.4#6332)