Return-Path: X-Original-To: apmail-accumulo-dev-archive@www.apache.org Delivered-To: apmail-accumulo-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 6E894D9A0 for ; Tue, 26 Jun 2012 19:43:44 +0000 (UTC) Received: (qmail 12163 invoked by uid 500); 26 Jun 2012 19:43:44 -0000 Delivered-To: apmail-accumulo-dev-archive@accumulo.apache.org Received: (qmail 12075 invoked by uid 500); 26 Jun 2012 19:43:44 -0000 Mailing-List: contact dev-help@accumulo.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@accumulo.apache.org Delivered-To: mailing list dev@accumulo.apache.org Received: (qmail 12064 invoked by uid 99); 26 Jun 2012 19:43:44 -0000 Received: from issues-vm.apache.org (HELO issues-vm) (140.211.11.160) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 26 Jun 2012 19:43:44 +0000 Received: from isssues-vm.apache.org (localhost [127.0.0.1]) by issues-vm (Postfix) with ESMTP id AEFEE1404B1 for ; Tue, 26 Jun 2012 19:43:43 +0000 (UTC) Date: Tue, 26 Jun 2012 19:43:42 +0000 (UTC) From: "Todd Lipcon (JIRA)" To: dev@accumulo.apache.org Message-ID: <1949806241.58071.1340739823719.JavaMail.jiratomcat@issues-vm> In-Reply-To: <1998994144.57636.1340735024185.JavaMail.jiratomcat@issues-vm> Subject: [jira] [Commented] (ACCUMULO-652) support block-based filtering within RFile MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/ACCUMULO-652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13401642#comment-13401642 ] Todd Lipcon commented on ACCUMULO-652: -------------------------------------- I tried to think through some similar techniques in HBASE-6014. Some of the comments there might be useful for working on this JIRA. > support block-based filtering within RFile > ------------------------------------------ > > Key: ACCUMULO-652 > URL: https://issues.apache.org/jira/browse/ACCUMULO-652 > Project: Accumulo > Issue Type: Bug > Reporter: Adam Fuchs > Assignee: Adam Fuchs > > If we keep some stats about what is in an RFile block, we might be able to efficiently [O(log N)], with high probability, implement filters that currently require linear table scans. Two use cases of this include timestamp range filtering (i.e. give me everything from last Tuesday) and cell-level security filtering (i.e. give me everything that I can see with my authorizations). > For the timestamp range filter, we can keep minimum and maximum timestamps across all keys used in a block within the index entry for that block. For the cell-level security filter, we can keep an aggregate label. This could be done using a simplified disjunction of all of the labels in the block. The extra block statistics information can propagate up the index hierarchy as well, giving nice performance characteristics for finding the next matching entry in a file. > In general, this is a heuristic technique that is good if data tends to naturally cluster in blocks with respect to the way it is queried. Testing its efficacy will require closely emulating real-world use cases -- tests like the continuous ingest test will not be sufficient. We will have to test for a few things: > # The cost for storing the extra stats in the index are not too expensive. > # The performance benefit for common use cases is significant. > # We shouldn't introduce any unacceptable worst-case behavior, like bloating the index to ridiculous proportions for any data set. > Eventually this will all need to be exposed through the Iterator API to be useful, which will be another ticket. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira