hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sergey Shelukhin (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.
Date Wed, 19 Dec 2012 22:47:15 GMT

    [ https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13536507#comment-13536507

Sergey Shelukhin commented on HBASE-5416:

bq. What is happening in SingleColumnValueExcludeFilter? We are removing filterKeyValue and
putting in place filterRow and hasFilterRow?
Max commented above:
"...I resolved this by checking that row is not empty right before filterRow(List) called,
but this requires to slightly modify SingleColumnValueExcludeFilter logic - move exclude phase
from filterKeyValue method to filterRow(List). The main reason for this is beacuse there is
no way to distinguish at RegionScanner::nextInternal level empty row which is empty because
of filter accepts row, but excludes all it's KVs and row which is empty due to filter rejects"

bq. Should filterBase do return filter.isFamilyEssential(name); rather than just return true
in isEssentialFamily.
FilterBase is base class, and Filter::isFamilyEssential is abstract. I guess it's just the
default behavior for most filters.

bq. Why is below in Region and not in RegionScanner?
It is in fact in RegionScanner, together with StoreHeap:
 class RegionScannerImpl implements RegionScanner {
    // Package local for testability
    KeyValueHeap storeHeap = null;
    // Heap of key-values that are not essential for the provided filters and are thus read
    // on demand, if lazy column family loading is enabled.
    KeyValueHeap joinedHeap = null;

This is a little obscene:

+ Collections.sort(results, comparator);

inside in HRegion merging results of 'essential' and 'non-essential' data (this probably should
be rephrased...). Can't be avoided though given what is going on here.
That is actually an interesting point, is there anything that prevents us from only sorting
results at the end, not for each row?

> Improve performance of scans with some kind of filters.
> -------------------------------------------------------
>                 Key: HBASE-5416
>                 URL: https://issues.apache.org/jira/browse/HBASE-5416
>             Project: HBase
>          Issue Type: Improvement
>          Components: Filters, Performance, regionserver
>    Affects Versions: 0.90.4
>            Reporter: Max Lapan
>            Assignee: Sergey Shelukhin
>             Fix For: 0.96.0
>         Attachments: 5416-Filtered_scans_v6.patch, 5416-v5.txt, 5416-v6.txt, Filtered_scans.patch,
Filtered_scans_v2.patch, Filtered_scans_v3.patch, Filtered_scans_v4.patch, Filtered_scans_v5.1.patch,
Filtered_scans_v5.patch, Filtered_scans_v7.patch, HBASE-5416-v10.patch, HBASE-5416-v7-rebased.patch,
HBASE-5416-v8.patch, HBASE-5416-v9.patch
> When the scan is performed, whole row is loaded into result list, after that filter (if
exists) is applied to detect that row is needed.
> But when scan is performed on several CFs and filter checks only data from the subset
of these CFs, data from CFs, not checked by a filter is not needed on a filter stage. Only
when we decided to include current row. And in such case we can significantly reduce amount
of IO performed by a scan, by loading only values, actually checked by a filter.
> For example, we have two CFs: flags and snap. Flags is quite small (bunch of megabytes)
and is used to filter large entries from snap. Snap is very large (10s of GB) and it is quite
costly to scan it. If we needed only rows with some flag specified, we use SingleColumnValueFilter
to limit result to only small subset of region. But current implementation is loading both
CFs to perform scan, when only small subset is needed.
> Attached patch adds one routine to Filter interface to allow filter to specify which
CF is needed to it's operation. In HRegion, we separate all scanners into two groups: needed
for filter and the rest (joined). When new row is considered, only needed data is loaded,
filter applied, and only if filter accepts the row, rest of data is loaded. At our data, this
speeds up such kind of scans 30-50 times. Also, this gives us the way to better normalize
the data into separate columns by optimizing the scans performed.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message