Return-Path: X-Original-To: apmail-cassandra-commits-archive@www.apache.org Delivered-To: apmail-cassandra-commits-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id C3E2F182B1 for ; Wed, 2 Mar 2016 14:07:23 +0000 (UTC) Received: (qmail 47773 invoked by uid 500); 2 Mar 2016 14:07:23 -0000 Delivered-To: apmail-cassandra-commits-archive@cassandra.apache.org Received: (qmail 47614 invoked by uid 500); 2 Mar 2016 14:07:20 -0000 Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@cassandra.apache.org Delivered-To: mailing list commits@cassandra.apache.org Received: (qmail 47579 invoked by uid 99); 2 Mar 2016 14:07:19 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 02 Mar 2016 14:07:19 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 879DD2C1F64 for ; Wed, 2 Mar 2016 14:07:18 +0000 (UTC) Date: Wed, 2 Mar 2016 14:07:18 +0000 (UTC) From: "Jonathan Ellis (JIRA)" To: commits@cassandra.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Comment Edited] (CASSANDRA-11206) Support large partitions on the 3.0 sstable format MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/CASSANDRA-11206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15175634#comment-15175634 ] Jonathan Ellis edited comment on CASSANDRA-11206 at 3/2/16 2:06 PM: -------------------------------------------------------------------- bq. For partitions < 64k (partitions without an IndexInfo object) we could skip the indirection during reads via RowIndexEntry at all by extending the IndexSummary and directly store the offset into the data file Since the idea here is to do something simple that we can be confident about shipping in 3.6 if CASSANDRA-9754 isn't ready, let's avoid making changes to the on disk layout. To clarify for others following along, bq. Remove IndexInfo from the key cache (not from the index file on disk, of course) This sounds scary but it's core to the goal here: if we're going to support large partitions, we can't afford the overhead either of keeping the entire summary on heap, or of reading it from disk in the first place. (If we're reading a 1KB row, then reading 2MB of summary first on a cache miss is a huge overhead.) Moving the key cache off heap (CASSANDRA-9738) would have helped with the first but not the second. So one approach is to go back to the old strategy of only caching the partition key location, and then go through the index bsearch using the offsets map every time. For small partitions this will be trivial and I hope negligible to the performance story vs the current cache. (If not, we can look at a hybrid strategy, but I'd like to avoid that complexity if possible.) bq. what I was thinking was that the key cache instead of storing a copy of the RIE it would store an offset into the index that is the location of the RIE. Then the RIE could be accessed off heap via a memory mapping without doing any allocations or copies I was thinking that even the offsets alone for a 4GB partition are going to be 256KB, so we don't want to cache the entire offsets map. But the other side there is that if you have a bunch of 4GB partitions you won't have very many of them. 16TB of data would be 1GB of offsets which is within the bounds of reasonable when off heap. And your approach may require less logic changes than the one above, since we're still "caching" the entire summary, sort of; only adding an extra indirection to read the IndexInfo entries. So that might well be simpler. was (Author: jbellis): bq. For partitions < 64k (partitions without an IndexInfo object) we could skip the indirection during reads via RowIndexEntry at all by extending the IndexSummary and directly store the offset into the data file Since the idea here is to do something simple that we can be confident about shipping in 3.6 if CASSANDRA-9754 isn't ready, let's avoid making changes to the on disk layout, i.e., your Plan B. To clarify for others following along, bq. Remove IndexInfo from the key cache (not from the index file on disk, of course) This sounds scary but it's core to the goal here: if we're going to support large partitions, we can't afford the overhead either of keeping the entire summary on heap, or of reading it from disk in the first place. (If we're reading a 1KB row, then reading 2MB of summary first on a cache miss is a huge overhead.) Moving the key cache off heap (CASSANDRA-9738) would have helped with the first but not the second. So one approach is to go back to the old strategy of only caching the partition key location, and then go through the index bsearch using the offsets map every time. For small partitions this will be trivial and I hope negligible to the performance story vs the current cache. (If not, we can look at a hybrid strategy, but I'd like to avoid that complexity if possible.) bq. what I was thinking was that the key cache instead of storing a copy of the RIE it would store an offset into the index that is the location of the RIE. Then the RIE could be accessed off heap via a memory mapping without doing any allocations or copies I was thinking that even the offsets alone for a 4GB partition are going to be 256KB, so we don't want to cache the entire offsets map. But the other side there is that if you have a bunch of 4GB partitions you won't have very many of them. 16TB of data would be 1GB of offsets which is within the bounds of reasonable when off heap. And your approach may require less logic changes than the one above, since we're still "caching" the entire summary, sort of; only adding an extra indirection to read the IndexInfo entries. So that might well be simpler. > Support large partitions on the 3.0 sstable format > -------------------------------------------------- > > Key: CASSANDRA-11206 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11206 > Project: Cassandra > Issue Type: Improvement > Reporter: Jonathan Ellis > Assignee: Robert Stupp > Fix For: 3.x > > > Cassandra saves a sample of IndexInfo objects that store the offset within each partition of every 64KB (by default) range of rows. To find a row, we binary search this sample, then scan the partition of the appropriate range. > The problem is that this scales poorly as partitions grow: on a cache miss, we deserialize the entire set of IndexInfo, which both creates a lot of GC overhead (as noted in CASSANDRA-9754) but is also non-negligible i/o activity (relative to reading a single 64KB row range) as partitions get truly large. > We introduced an "offset map" in CASSANDRA-10314 that allows us to perform the IndexInfo bsearch while only deserializing IndexInfo that we need to compare against, i.e. log(N) deserializations. -- This message was sent by Atlassian JIRA (v6.3.4#6332)