Return-Path: Delivered-To: apmail-cassandra-commits-archive@www.apache.org Received: (qmail 78814 invoked from network); 3 Dec 2010 23:18:39 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 3 Dec 2010 23:18:39 -0000 Received: (qmail 19726 invoked by uid 500); 3 Dec 2010 23:18:38 -0000 Delivered-To: apmail-cassandra-commits-archive@cassandra.apache.org Received: (qmail 19703 invoked by uid 500); 3 Dec 2010 23:18:38 -0000 Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@cassandra.apache.org Delivered-To: mailing list commits@cassandra.apache.org Received: (qmail 19695 invoked by uid 99); 3 Dec 2010 23:18:38 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 03 Dec 2010 23:18:38 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.22] (HELO thor.apache.org) (140.211.11.22) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 03 Dec 2010 23:18:36 +0000 Received: from thor (localhost [127.0.0.1]) by thor.apache.org (8.13.8+Sun/8.13.8) with ESMTP id oB3NIEHe004349 for ; Fri, 3 Dec 2010 23:18:14 GMT Message-ID: <2457897.103071291418294776.JavaMail.jira@thor> Date: Fri, 3 Dec 2010 18:18:14 -0500 (EST) From: "Ryan King (JIRA)" To: commits@cassandra.apache.org Subject: [jira] Updated: (CASSANDRA-1555) Considerations for larger bloom filters MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/CASSANDRA-1555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan King updated CASSANDRA-1555: --------------------------------- Attachment: CASSANDRA-1555v2.patch Here's a patch that takes a better approach- It uses the SSTable version to tell which type of bloomfilter to use. In order to make this work I had to do some refactorings in the Iterators. There were a number of places where we were passing around CFMetaData objects, where a SSTableReader would be better because it would allow us to get at the Descriptor for that table. AFAICT all the callpoints had an SSTableReader available, so this refactoring was not very intrusive. The new thing is the BigBloomFilter, which uses OpenBitset and LongMurmurHash. Some of it is copy/paste from BloomFilter. All unit and system test pass, but this could use some more testing, for sure, especially around the upgrade path. Also, the LongMurmurHash seems to have more collisions. I'll see if I can figure out why. One other note: FilterTest became FilterTestHelper because it no longer has any test methods of its own. > Considerations for larger bloom filters > --------------------------------------- > > Key: CASSANDRA-1555 > URL: https://issues.apache.org/jira/browse/CASSANDRA-1555 > Project: Cassandra > Issue Type: Improvement > Components: Core > Reporter: Stu Hood > Assignee: Ryan King > Fix For: 0.8 > > Attachments: cassandra-1555.tgz, CASSANDRA-1555v2.patch > > > To (optimally) support SSTables larger than 143 million keys, we need to support bloom filters larger than 2^31 bits, which java.util.BitSet can't handle directly. > A few options: > * Switch to a BitSet class which supports 2^31 * 64 bits (Lucene's OpenBitSet) > * Partition the java.util.BitSet behind our current BloomFilter > ** Straightforward bit partitioning: bit N is in bitset N // 2^31 > ** Separate equally sized complete bloom filters for member ranges, which can be used independently or OR'd together under memory pressure. > All of these options require new approaches to serialization. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.