hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nicolas Spiegelberg (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.
Date Fri, 24 Feb 2012 16:26:49 GMT

    [ https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13215720#comment-13215720

Nicolas Spiegelberg commented on HBASE-5416:

Overall, I agree that this is a useful design pattern.  We use this pattern in our messages
deployment and other production use cases as well.  I'm more concerned about this being in
the critical path.  This is deep in the core logic, which has a lot of complicated usage and
is extremely bug-prone (even after extensive unit tests).

If you don't need atomicity, then you don't get much benefit from solving this in the critical
path.  The change introduces a lot of risk and design decisions that we have to worry about
years later.  It might be some work to understand how to use a batch factor; but don't you
think it would take more work to understand the variety of use cases for scans to ensure that
we don't introduce side effects and make a scalable architectural decision?

At the very least, we should get a scan expert to look at this code before committing.  I'm
not one, but I know this isn't the same as making a business logic change.  I just have one
question about the patch right now:  Should we have unit tests case for ensuring the interop
between this feature and 'limit'?  For example, ensure that joinedHeap is scanned before going
to the next row if the storeHeap results.size() == limit

> Improve performance of scans with some kind of filters.
> -------------------------------------------------------
>                 Key: HBASE-5416
>                 URL: https://issues.apache.org/jira/browse/HBASE-5416
>             Project: HBase
>          Issue Type: Improvement
>          Components: filters, performance, regionserver
>    Affects Versions: 0.90.4
>            Reporter: Max Lapan
>            Assignee: Max Lapan
>         Attachments: 5416-v5.txt, 5416-v6.txt, Filtered_scans.patch, Filtered_scans_v2.patch,
Filtered_scans_v3.patch, Filtered_scans_v4.patch
> When the scan is performed, whole row is loaded into result list, after that filter (if
exists) is applied to detect that row is needed.
> But when scan is performed on several CFs and filter checks only data from the subset
of these CFs, data from CFs, not checked by a filter is not needed on a filter stage. Only
when we decided to include current row. And in such case we can significantly reduce amount
of IO performed by a scan, by loading only values, actually checked by a filter.
> For example, we have two CFs: flags and snap. Flags is quite small (bunch of megabytes)
and is used to filter large entries from snap. Snap is very large (10s of GB) and it is quite
costly to scan it. If we needed only rows with some flag specified, we use SingleColumnValueFilter
to limit result to only small subset of region. But current implementation is loading both
CFs to perform scan, when only small subset is needed.
> Attached patch adds one routine to Filter interface to allow filter to specify which
CF is needed to it's operation. In HRegion, we separate all scanners into two groups: needed
for filter and the rest (joined). When new row is considered, only needed data is loaded,
filter applied, and only if filter accepts the row, rest of data is loaded. At our data, this
speeds up such kind of scans 30-50 times. Also, this gives us the way to better normalize
the data into separate columns by optimizing the scans performed.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message