hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lars Hofhansl (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-14509) Configurable sparse indexes?
Date Sat, 03 Oct 2015 04:34:27 GMT

    [ https://issues.apache.org/jira/browse/HBASE-14509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14942101#comment-14942101

Lars Hofhansl commented on HBASE-14509:

The other part that's needed is to actually make use of the index, i.e. filter the HFiles
that do not contain a range of values requested.
Coprocessors won't work here, they are a level too high for this (and the region level).

We could add a method to filter, which is passed an HFile or a FileInfo or something, and
based on that gets to decide whether to include the HFile or not. Thoughts [~apurtell], [~stack]?

The other question is whether HFile is too large of a unit. Assuming CQ values are all over
the place, storing min/max per HFile is not be very selective (i.e. a large HFile will likely
contain a very small and a very value for a specific CQ). So maybe we need to record min/max
CQ value for a range of keys. I.e. we have a mapping from (key1, key2) -> (min CQ, max
CQ), then as we scan we skip scan to the next key1 of we find the value range does not contain
the value we're looking for.

Lastly we need indicate at compaction time what CQs to keep track of. That would be a sly
introduction of (some) schema.
We can try to automate that, but we can't keep track of all them, there might be many, or
the CQ values might be very large.

Or we punt and just add the building blocks: Add the API I mention to Filter, and allow coprocessors
to add record and add stuff to the HFile trailer. Then higher level tools like Phoenix can
add the appropriate logic.

> Configurable sparse indexes?
> ----------------------------
>                 Key: HBASE-14509
>                 URL: https://issues.apache.org/jira/browse/HBASE-14509
>             Project: HBase
>          Issue Type: Brainstorming
>            Reporter: Lars Hofhansl
> This idea just popped up today and I wanted to record it for discussion:
> What if we kept sparse column indexes per region or HFile or per configurable range?
> I.e. For any given CQ we record the lowest and highest value for a particular range (HFile,
Region, or a custom range like the Phoenix guide post).
> By tweaking the size of these ranges we can control the size of the index, vs its selectivity.
> For example if we kept it by HFile we can almost instantly decide whether we need scan
a particular HFile at all to find a particular value in a Cell.
> We can also collect min/max values for each n MB of data, for example when we can the
region the first time. Assuming ranges are large enough we can always keep the index in memory
together with the region.
> Kind of a sparse local index. Might much easier than the buddy region stuff we've been

This message was sent by Atlassian JIRA

View raw message