accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Christopher Tubbs (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (ACCUMULO-652) support block-based filtering within RFile
Date Thu, 22 Jan 2015 20:52:35 GMT

    [ https://issues.apache.org/jira/browse/ACCUMULO-652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14288191#comment-14288191
] 

Christopher Tubbs commented on ACCUMULO-652:
--------------------------------------------

This is still open. What are your plans for merging/closing this for 1.7.0?

> support block-based filtering within RFile
> ------------------------------------------
>
>                 Key: ACCUMULO-652
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-652
>             Project: Accumulo
>          Issue Type: Improvement
>          Components: tserver
>            Reporter: Adam Fuchs
>            Assignee: Adam Fuchs
>             Fix For: 1.7.0
>
>          Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> If we keep some stats about what is in an RFile block, we might be able to efficiently
[O(log N)], with high probability, implement filters that currently require linear table scans.
Two use cases of this include timestamp range filtering (i.e. give me everything from last
Tuesday) and cell-level security filtering (i.e. give me everything that I can see with my
authorizations).
> For the timestamp range filter, we can keep minimum and maximum timestamps across all
keys used in a block within the index entry for that block. For the cell-level security filter,
we can keep an aggregate label. This could be done using a simplified disjunction of all of
the labels in the block. The extra block statistics information can propagate up the index
hierarchy as well, giving nice performance characteristics for finding the next matching entry
in a file.
> In general, this is a heuristic technique that is good if data tends to naturally cluster
in blocks with respect to the way it is queried. Testing its efficacy will require closely
emulating real-world use cases -- tests like the continuous ingest test will not be sufficient.
We will have to test for a few things:
> # The cost for storing the extra stats in the index are not too expensive.
> # The performance benefit for common use cases is significant.
> # We shouldn't introduce any unacceptable worst-case behavior, like bloating the index
to ridiculous proportions for any data set.
> Eventually this will all need to be exposed through the Iterator API to be useful, which
will be another ticket. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message