Return-Path: X-Original-To: apmail-cassandra-commits-archive@www.apache.org Delivered-To: apmail-cassandra-commits-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id CF65F17833 for ; Tue, 31 Mar 2015 08:26:54 +0000 (UTC) Received: (qmail 27315 invoked by uid 500); 31 Mar 2015 08:26:54 -0000 Delivered-To: apmail-cassandra-commits-archive@cassandra.apache.org Received: (qmail 27272 invoked by uid 500); 31 Mar 2015 08:26:54 -0000 Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@cassandra.apache.org Delivered-To: mailing list commits@cassandra.apache.org Received: (qmail 27259 invoked by uid 99); 31 Mar 2015 08:26:54 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 31 Mar 2015 08:26:54 +0000 Date: Tue, 31 Mar 2015 08:26:54 +0000 (UTC) From: "Stefania (JIRA)" To: commits@cassandra.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (CASSANDRA-8180) Optimize disk seek using min/max colunm name meta data when the LIMIT clause is used MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/CASSANDRA-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14388229#comment-14388229 ] Stefania commented on CASSANDRA-8180: ------------------------------------- You are correct a Clusterable was sufficient. There is one disk access I am not sure if we can remove, when we read the partition level deletion: {code} status.mostRecentPartitionTombstone = Math.max(status.mostRecentPartitionTombstone, iter.partitionLevelDeletion().markedForDeleteAt()); {code} We need to read the partition level deletion in the initial for loop of {{SinglePartitionReadCommand}} with the sstables ordered by max timestamp in order to skip older sstables: {code} if (sstable.getMaxTimestamp() < status.mostRecentPartitionTombstone) break; {code} I don't see how we could skip older sstables if they are not ordered by max timestamp and are instead picked lazily by the merge iterator when they become eligible according to their lower bound. As a consequence I don't know how to postpone reading the partition level deletion. So far I have some code that still calls {{iter.partitionLevelDeletion()}} in that initial for loop, but other iterator methods should not be called until the table is picked by the merge iterator. Can we do better? Code is here: https://github.com/stef1927/cassandra/tree/8180-8099 > Optimize disk seek using min/max colunm name meta data when the LIMIT clause is used > ------------------------------------------------------------------------------------ > > Key: CASSANDRA-8180 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8180 > Project: Cassandra > Issue Type: Improvement > Components: Core > Environment: Cassandra 2.0.10 > Reporter: DOAN DuyHai > Assignee: Stefania > Priority: Minor > Fix For: 3.0 > > > I was working on an example of sensor data table (timeseries) and face a use case where C* does not optimize read on disk. > {code} > cqlsh:test> CREATE TABLE test(id int, col int, val text, PRIMARY KEY(id,col)) WITH CLUSTERING ORDER BY (col DESC); > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 10, '10'); > ... > >nodetool flush test test > ... > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 20, '20'); > ... > >nodetool flush test test > ... > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 30, '30'); > ... > >nodetool flush test test > {code} > After that, I activate request tracing: > {code} > cqlsh:test> SELECT * FROM test WHERE id=1 LIMIT 1; > activity | timestamp | source | source_elapsed > ---------------------------------------------------------------------------+--------------+-----------+---------------- > execute_cql3_query | 23:48:46,498 | 127.0.0.1 | 0 > Parsing SELECT * FROM test WHERE id=1 LIMIT 1; | 23:48:46,498 | 127.0.0.1 | 74 > Preparing statement | 23:48:46,499 | 127.0.0.1 | 253 > Executing single-partition query on test | 23:48:46,499 | 127.0.0.1 | 930 > Acquiring sstable references | 23:48:46,499 | 127.0.0.1 | 943 > Merging memtable tombstones | 23:48:46,499 | 127.0.0.1 | 1032 > Key cache hit for sstable 3 | 23:48:46,500 | 127.0.0.1 | 1160 > Seeking to partition beginning in data file | 23:48:46,500 | 127.0.0.1 | 1173 > Key cache hit for sstable 2 | 23:48:46,500 | 127.0.0.1 | 1889 > Seeking to partition beginning in data file | 23:48:46,500 | 127.0.0.1 | 1901 > Key cache hit for sstable 1 | 23:48:46,501 | 127.0.0.1 | 2373 > Seeking to partition beginning in data file | 23:48:46,501 | 127.0.0.1 | 2384 > Skipped 0/3 non-slice-intersecting sstables, included 0 due to tombstones | 23:48:46,501 | 127.0.0.1 | 2768 > Merging data from memtables and 3 sstables | 23:48:46,501 | 127.0.0.1 | 2784 > Read 2 live and 0 tombstoned cells | 23:48:46,501 | 127.0.0.1 | 2976 > Request complete | 23:48:46,501 | 127.0.0.1 | 3551 > {code} > We can clearly see that C* hits 3 SSTables on disk instead of just one, although it has the min/max column meta data to decide which SSTable contains the most recent data. > Funny enough, if we add a clause on the clustering column to the select, this time C* optimizes the read path: > {code} > cqlsh:test> SELECT * FROM test WHERE id=1 AND col > 25 LIMIT 1; > activity | timestamp | source | source_elapsed > ---------------------------------------------------------------------------+--------------+-----------+---------------- > execute_cql3_query | 23:52:31,888 | 127.0.0.1 | 0 > Parsing SELECT * FROM test WHERE id=1 AND col > 25 LIMIT 1; | 23:52:31,888 | 127.0.0.1 | 60 > Preparing statement | 23:52:31,888 | 127.0.0.1 | 277 > Executing single-partition query on test | 23:52:31,889 | 127.0.0.1 | 961 > Acquiring sstable references | 23:52:31,889 | 127.0.0.1 | 971 > Merging memtable tombstones | 23:52:31,889 | 127.0.0.1 | 1020 > Key cache hit for sstable 3 | 23:52:31,889 | 127.0.0.1 | 1108 > Seeking to partition beginning in data file | 23:52:31,889 | 127.0.0.1 | 1117 > Skipped 2/3 non-slice-intersecting sstables, included 0 due to tombstones | 23:52:31,889 | 127.0.0.1 | 1611 > Merging data from memtables and 1 sstables | 23:52:31,890 | 127.0.0.1 | 1624 > Read 1 live and 0 tombstoned cells | 23:52:31,890 | 127.0.0.1 | 1700 > Request complete | 23:52:31,890 | 127.0.0.1 | 2140 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)