Return-Path: X-Original-To: apmail-cassandra-commits-archive@www.apache.org Delivered-To: apmail-cassandra-commits-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 4A11D10AF5 for ; Wed, 2 Oct 2013 18:25:50 +0000 (UTC) Received: (qmail 79542 invoked by uid 500); 2 Oct 2013 18:25:47 -0000 Delivered-To: apmail-cassandra-commits-archive@cassandra.apache.org Received: (qmail 79401 invoked by uid 500); 2 Oct 2013 18:25:47 -0000 Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@cassandra.apache.org Delivered-To: mailing list commits@cassandra.apache.org Received: (qmail 79168 invoked by uid 99); 2 Oct 2013 18:25:46 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 02 Oct 2013 18:25:46 +0000 Date: Wed, 2 Oct 2013 18:25:46 +0000 (UTC) From: "Tyler Hobbs (JIRA)" To: commits@cassandra.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (CASSANDRA-1337) parallelize fetching rows for low-cardinality indexes MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/CASSANDRA-1337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Hobbs updated CASSANDRA-1337: ----------------------------------- Attachment: 0001-Concurrent-range-and-2ary-index-subqueries.patch Patch 0001 is a somewhat re-worked version of David's patch against trunk. Breaking down estimates by the filter type was neglecting all non-compact cql3 queries (and some compact ones). A simpler way to estimate the number of results is to break down the queries into four groups: * Secondary index query: use the mean columns from the index CF (one column per data row or cql3 data row) * Non-cql3 range query: use the estimated keys * cql3 range query, compact (compact storage or single-component primary key): use the estimated keys * cql3 range query, non-compact: use (estimated_keys * mean_columns) / (number_of_defined_columns) The last case is the least accurate. When collections are involved, it will overestimate the number of cql3 rows that will be returned, meaning additional ranges may need to be queried, but I think this is an acceptable optimization degradation. Another change from David's patch is that if an insufficient number of results are fetched by the first round, the concurrency factor will be recalculated based on the results we've seen so far instead of simply being set to 1. I wasn't sure where to add tests for this; would dtests be the best place? > parallelize fetching rows for low-cardinality indexes > ----------------------------------------------------- > > Key: CASSANDRA-1337 > URL: https://issues.apache.org/jira/browse/CASSANDRA-1337 > Project: Cassandra > Issue Type: Improvement > Reporter: Jonathan Ellis > Assignee: Tyler Hobbs > Priority: Minor > Fix For: 2.1 > > Attachments: 0001-Concurrent-range-and-2ary-index-subqueries.patch, 1137-bugfix.patch, 1337.patch, 1337-v4.patch, ASF.LICENSE.NOT.GRANTED--0001-CASSANDRA-1337-scan-concurrently-depending-on-num-rows.txt, CASSANDRA-1337.patch > > Original Estimate: 8h > Remaining Estimate: 8h > > currently, we read the indexed rows from the first node (in partitioner order); if that does not have enough matching rows, we read the rows from the next, and so forth. > we should use the statistics fom CASSANDRA-1155 to query multiple nodes in parallel, such that we have a high chance of getting enough rows w/o having to do another round of queries (but, if our estimate is incorrect, we do need to loop and do more rounds until we have enough data or we have fetched from each node). -- This message was sent by Atlassian JIRA (v6.1#6144)