hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Max Lapan (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.
Date Tue, 18 Dec 2012 08:24:15 GMT

    [ https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13534750#comment-13534750
] 

Max Lapan commented on HBASE-5416:
----------------------------------

Hi!

The patch misses one small fix I made this summer (foregot to post it here, sorry). It is
trivial in code, but a little tricky in logic.

The problem is in SingleColumnValueFilter with filterIfMissing=false (default). In that case,
filter must allow records with filtered columns not present in row. But this performance optimisation
have no way to detect such rows, because we first scan CFs added to filters. So, it can miss
these rows completely in a result. The solution is quite simple - turn off optimisation when
filterIfMissing is false.

My patch for 0.90.6, could you, please, apply it?
{code}
commit 66b32a09e59fe12bfab55e819336678114269bb8
Author: Max Lapan <max.lapan@gmail.com>
Date:   Thu Aug 30 17:22:45 2012 +0400

    Disable fast scans when filterIfMissed=false

diff --git a/src/main/java/org/apache/hadoop/hbase/filter/SingleColumnValueFilter.java b/src/main/java/org/apache/hadoop/hbase/filter/SingleColumnValueFilter.java
index 105009e..2983a5f 100644
--- a/src/main/java/org/apache/hadoop/hbase/filter/SingleColumnValueFilter.java
+++ b/src/main/java/org/apache/hadoop/hbase/filter/SingleColumnValueFilter.java
@@ -276,9 +276,14 @@ public class SingleColumnValueFilter extends FilterBase {
 
   /**
    * The only thing this filter need to check row is given column family. So,
-   * it's the only essential column in whole scan.
+   * it's the only essential column in whole scan. If filterIfMissing==false,
+   * all families are essential, because of a possibility to skip valid rows
+   * without data in filtered CF.
    */
   public boolean isFamilyEssential(byte[] name) {
-    return Bytes.equals(name, this.columnFamily);
+    if (!this.filterIfMissing)
+      return true;
+    else
+      return Bytes.equals(name, this.columnFamily);
   }
 }
{code}
                
> Improve performance of scans with some kind of filters.
> -------------------------------------------------------
>
>                 Key: HBASE-5416
>                 URL: https://issues.apache.org/jira/browse/HBASE-5416
>             Project: HBase
>          Issue Type: Improvement
>          Components: Filters, Performance, regionserver
>    Affects Versions: 0.90.4
>            Reporter: Max Lapan
>            Assignee: Max Lapan
>             Fix For: 0.96.0
>
>         Attachments: 5416-Filtered_scans_v6.patch, 5416-v5.txt, 5416-v6.txt, Filtered_scans.patch,
Filtered_scans_v2.patch, Filtered_scans_v3.patch, Filtered_scans_v4.patch, Filtered_scans_v5.1.patch,
Filtered_scans_v5.patch, Filtered_scans_v7.patch, HBASE-5416-v7-rebased.patch, HBASE-5416-v8.patch,
HBASE-5416-v9.patch
>
>
> When the scan is performed, whole row is loaded into result list, after that filter (if
exists) is applied to detect that row is needed.
> But when scan is performed on several CFs and filter checks only data from the subset
of these CFs, data from CFs, not checked by a filter is not needed on a filter stage. Only
when we decided to include current row. And in such case we can significantly reduce amount
of IO performed by a scan, by loading only values, actually checked by a filter.
> For example, we have two CFs: flags and snap. Flags is quite small (bunch of megabytes)
and is used to filter large entries from snap. Snap is very large (10s of GB) and it is quite
costly to scan it. If we needed only rows with some flag specified, we use SingleColumnValueFilter
to limit result to only small subset of region. But current implementation is loading both
CFs to perform scan, when only small subset is needed.
> Attached patch adds one routine to Filter interface to allow filter to specify which
CF is needed to it's operation. In HRegion, we separate all scanners into two groups: needed
for filter and the rest (joined). When new row is considered, only needed data is loaded,
filter applied, and only if filter accepts the row, rest of data is loaded. At our data, this
speeds up such kind of scans 30-50 times. Also, this gives us the way to better normalize
the data into separate columns by optimizing the scans performed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message