hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dave Latham (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.
Date Sat, 23 Feb 2013 15:44:12 GMT

    [ https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13585148#comment-13585148

Dave Latham commented on HBASE-5416:

How hard is it to change filter to use FilterBase and replace it first?
The change is very simple for us.  It means we need to a wait a bit before deploying the hbase
upgrade until we can upgrade our client apps first, though.  This is what we've decided to
do, so this incompatibility is not going to be a blocker for us, just a slight delay.

I'd be interested in why you had to implement Filter directly rather than extending FilterBase.
This particular Filter implementation was made as a wrapper around any other Filter as part
of some experiments we were doing for more dynamic Filter classloading a couple years back.
 I don't think there was a FilterBase class at the time or we may have just chose to make
it a generic Filter (or actually RowFilterInterface back then) to make sure it implements
and wraps every method.

I think leaving the method in FilterBase only for 0.94 would be a good move.  However, it's
a bit tricky since 0.94.5 has already been released.  If the method is dropped from Filter
in 0.94.6 then we're saying 0.94.6 is compatible with everything but 0.94.5.  However if you
were unfortunate enough to start on 0.94.5 and implement Filter directly then you're going
to break again.  Perhaps that's a rare enough case.
> Improve performance of scans with some kind of filters.
> -------------------------------------------------------
>                 Key: HBASE-5416
>                 URL: https://issues.apache.org/jira/browse/HBASE-5416
>             Project: HBase
>          Issue Type: Improvement
>          Components: Filters, Performance, regionserver
>    Affects Versions: 0.90.4
>            Reporter: Max Lapan
>            Assignee: Sergey Shelukhin
>             Fix For: 0.96.0, 0.94.5
>         Attachments: 5416-0.94-v1.txt, 5416-0.94-v2.txt, 5416-0.94-v3.txt, 5416-drop-new-method-from-filter.txt,
5416-Filtered_scans_v6.patch, 5416-v13.patch, 5416-v14.patch, 5416-v15.patch, 5416-v16.patch,
5416-v5.txt, 5416-v6.txt, Filtered_scans.patch, Filtered_scans_v2.patch, Filtered_scans_v3.patch,
Filtered_scans_v4.patch, Filtered_scans_v5.1.patch, Filtered_scans_v5.patch, Filtered_scans_v7.patch,
HBASE-5416-v10.patch, HBASE-5416-v11.patch, HBASE-5416-v12.patch, HBASE-5416-v12.patch, HBASE-5416-v7-rebased.patch,
HBASE-5416-v8.patch, HBASE-5416-v9.patch, org.apache.hadoop.hbase.regionserver.TestHRegion-output.txt
> When the scan is performed, whole row is loaded into result list, after that filter (if
exists) is applied to detect that row is needed.
> But when scan is performed on several CFs and filter checks only data from the subset
of these CFs, data from CFs, not checked by a filter is not needed on a filter stage. Only
when we decided to include current row. And in such case we can significantly reduce amount
of IO performed by a scan, by loading only values, actually checked by a filter.
> For example, we have two CFs: flags and snap. Flags is quite small (bunch of megabytes)
and is used to filter large entries from snap. Snap is very large (10s of GB) and it is quite
costly to scan it. If we needed only rows with some flag specified, we use SingleColumnValueFilter
to limit result to only small subset of region. But current implementation is loading both
CFs to perform scan, when only small subset is needed.
> Attached patch adds one routine to Filter interface to allow filter to specify which
CF is needed to it's operation. In HRegion, we separate all scanners into two groups: needed
for filter and the rest (joined). When new row is considered, only needed data is loaded,
filter applied, and only if filter accepts the row, rest of data is loaded. At our data, this
speeds up such kind of scans 30-50 times. Also, this gives us the way to better normalize
the data into separate columns by optimizing the scans performed.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message