Return-Path: X-Original-To: apmail-hbase-dev-archive@www.apache.org Delivered-To: apmail-hbase-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 89BD8EA1A for ; Wed, 20 Feb 2013 02:25:13 +0000 (UTC) Received: (qmail 62832 invoked by uid 500); 20 Feb 2013 02:25:12 -0000 Delivered-To: apmail-hbase-dev-archive@hbase.apache.org Received: (qmail 62641 invoked by uid 500); 20 Feb 2013 02:25:12 -0000 Mailing-List: contact dev-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hbase.apache.org Delivered-To: mailing list dev@hbase.apache.org Received: (qmail 62627 invoked by uid 99); 20 Feb 2013 02:25:12 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 20 Feb 2013 02:25:12 +0000 Date: Wed, 20 Feb 2013 02:25:12 +0000 (UTC) From: "clockfly (JIRA)" To: dev@hbase.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Created] (HBASE-7885) bloom filter compaction is too aggressive for Hfile which only contains small count of records MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 clockfly created HBASE-7885: ------------------------------- Summary: bloom filter compaction is too aggressive for Hfile which only contains small count of records Key: HBASE-7885 URL: https://issues.apache.org/jira/browse/HBASE-7885 Project: HBase Issue Type: Bug Components: Performance, Scanners Affects Versions: 0.94.5 Reporter: clockfly Priority: Minor Fix For: 0.94.5 For HFile V2, the bloom filter will take a initial size, 128KB. When there are not that much records inserted into the bloom filter, the bloom fitler will start to shrink itself to do compaction. For example, for 128K, it will compact to 64K ->32K->16K->8K->4K->2K->1K->512->256->128->64->32, as long as it think that it can be bounded by the estimate error rate. If we puts only a few records in the HFile, the bloom filter will be compacted to too small, then it will break the assumption that shrinking will still be bounded by the estimated error rate. The False positive rate will becomes un-acceptable high. For example, if we set the expected error rate is 0.00001, for 10 records, after compaction, The size of the bloom filter will be 64 bytes. The real effective false positive rate will be 50%. The use case is like this, if we are using HBase to store big record like images, and binaries, each record will take megabytes. Then for a 128M file, it will only contains dozens of records. The suggested fix is to set a lower limit for the bloom filter compaction process. I suggest to use 1000 bytes. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira