hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ryan rawson (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HBASE-2794) ROWCOL bloom filter not used if multiple columns within same family are requested in a Get
Date Mon, 12 Jul 2010 20:32:57 GMT

    [ https://issues.apache.org/jira/browse/HBASE-2794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12887511#action_12887511

ryan rawson commented on HBASE-2794:

Consider a table with 12 billion rows. At 9 bits/row, we are looking
at 13500000000 bytes of ram (base) to store the blooms in ram. That is
12.57 GB ram to store the blooms.  The memory competes with the block
cache, thus you are losing 12.57 GB ram that could be used to cache
blocks.  If your data is in block cache, seeking is free, thus there
is an essential trade off here.

In my case, the 12b rows are small ones, and thus we have a lot of
rows for the actual data size.  On a different dataset, the row count
might be smaller for a the actual data size and it might be
worthwhile.  Furthermore, blooms don't work on Scans and only Gets.

The key takeaway here is that (a) bloom filters are not free and
potentially very expensive in terms of RAM, (b) bloom data competes
with the block cache, and (c) the trade off depends on the data set
and access patterns.

On Mon, Jul 12, 2010 at 12:07 PM, HBase Review Board (JIRA)

> ROWCOL bloom filter not used if multiple columns within same family are requested in
a Get
> ------------------------------------------------------------------------------------------
>                 Key: HBASE-2794
>                 URL: https://issues.apache.org/jira/browse/HBASE-2794
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Kannan Muthukkaruppan
> Noticed the following snippet in StoreFile.java:Scanner:shouldSeek():
> {code}
>         switch(bloomFilterType) {
>           case ROW:
>             key = row;
>             break;
>           case ROWCOL:
>             if (columns.size() == 1) {
>               byte[] col = columns.first();
>               key = Bytes.add(row, col);
>               break;
>             }
>             //$FALL-THROUGH$
>           default:
>             return true;
>         }
> {code}
> If columns.size > 1, then we currently don't take advantage of the bloom filter. 
We should optimize this to check bloom for each of columns and if none of the columns are
present in the bloom avoid opening the file.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message