hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ted Yu (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.
Date Fri, 14 Dec 2012 04:30:12 GMT

    [ https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13532037#comment-13532037
] 

Ted Yu commented on HBASE-5416:
-------------------------------

Based on patch v7, I got the following result on MacBook:
{code}
grep 'scanner finished in' ../testJoinedScanners-output.txt
2012-12-13 20:09:26,809 INFO  [main] regionserver.TestJoinedScanners(172): Slow scanner finished
in 112.792634 seconds, got 100 rows
2012-12-13 20:10:15,726 INFO  [main] regionserver.TestJoinedScanners(172): Joined scanner
finished in 48.915989 seconds, got 100 rows
2012-12-13 20:10:33,006 INFO  [main] regionserver.TestJoinedScanners(172): Slow scanner finished
in 17.280432 seconds, got 100 rows
2012-12-13 20:10:38,514 INFO  [main] regionserver.TestJoinedScanners(172): Joined scanner
finished in 5.508207 seconds, got 100 rows
2012-12-13 20:10:51,095 INFO  [main] regionserver.TestJoinedScanners(172): Slow scanner finished
in 12.580323 seconds, got 100 rows
2012-12-13 20:11:00,517 INFO  [main] regionserver.TestJoinedScanners(172): Joined scanner
finished in 9.422024 seconds, got 100 rows
2012-12-13 20:11:22,650 INFO  [main] regionserver.TestJoinedScanners(172): Slow scanner finished
in 22.132854 seconds, got 100 rows
2012-12-13 20:11:31,890 INFO  [main] regionserver.TestJoinedScanners(172): Joined scanner
finished in 9.23955 seconds, got 100 rows
2012-12-13 20:11:34,421 INFO  [main] regionserver.TestJoinedScanners(172): Slow scanner finished
in 2.531598 seconds, got 100 rows
2012-12-13 20:11:36,694 INFO  [main] regionserver.TestJoinedScanners(172): Joined scanner
finished in 2.272578 seconds, got 100 rows
2012-12-13 20:11:39,197 INFO  [main] regionserver.TestJoinedScanners(172): Slow scanner finished
in 2.502777 seconds, got 100 rows
2012-12-13 20:11:58,269 INFO  [main] regionserver.TestJoinedScanners(172): Joined scanner
finished in 19.071438 seconds, got 100 rows
2012-12-13 20:12:01,043 INFO  [main] regionserver.TestJoinedScanners(172): Slow scanner finished
in 2.774262 seconds, got 100 rows
2012-12-13 20:12:03,317 INFO  [main] regionserver.TestJoinedScanners(172): Joined scanner
finished in 2.273745 seconds, got 100 rows
2012-12-13 20:12:05,981 INFO  [main] regionserver.TestJoinedScanners(172): Slow scanner finished
in 2.664124 seconds, got 100 rows
2012-12-13 20:12:08,574 INFO  [main] regionserver.TestJoinedScanners(172): Joined scanner
finished in 2.593234 seconds, got 100 rows
2012-12-13 20:12:11,130 INFO  [main] regionserver.TestJoinedScanners(172): Slow scanner finished
in 2.555977 seconds, got 100 rows
2012-12-13 20:12:13,381 INFO  [main] regionserver.TestJoinedScanners(172): Joined scanner
finished in 2.250275 seconds, got 100 rows
2012-12-13 20:12:15,721 INFO  [main] regionserver.TestJoinedScanners(172): Slow scanner finished
in 2.340003 seconds, got 100 rows
2012-12-13 20:12:18,075 INFO  [main] regionserver.TestJoinedScanners(172): Joined scanner
finished in 2.354218 seconds, got 100 rows
{code}
I am running the test on Linux.

Will take another look at the patch and test result tomorrow.
                
> Improve performance of scans with some kind of filters.
> -------------------------------------------------------
>
>                 Key: HBASE-5416
>                 URL: https://issues.apache.org/jira/browse/HBASE-5416
>             Project: HBase
>          Issue Type: Improvement
>          Components: Filters, Performance, regionserver
>    Affects Versions: 0.90.4
>            Reporter: Max Lapan
>            Assignee: Max Lapan
>             Fix For: 0.96.0
>
>         Attachments: 5416-Filtered_scans_v6.patch, 5416-v5.txt, 5416-v6.txt, Filtered_scans.patch,
Filtered_scans_v2.patch, Filtered_scans_v3.patch, Filtered_scans_v4.patch, Filtered_scans_v5.1.patch,
Filtered_scans_v5.patch, Filtered_scans_v7.patch, HBASE-5416-v7-rebased.patch
>
>
> When the scan is performed, whole row is loaded into result list, after that filter (if
exists) is applied to detect that row is needed.
> But when scan is performed on several CFs and filter checks only data from the subset
of these CFs, data from CFs, not checked by a filter is not needed on a filter stage. Only
when we decided to include current row. And in such case we can significantly reduce amount
of IO performed by a scan, by loading only values, actually checked by a filter.
> For example, we have two CFs: flags and snap. Flags is quite small (bunch of megabytes)
and is used to filter large entries from snap. Snap is very large (10s of GB) and it is quite
costly to scan it. If we needed only rows with some flag specified, we use SingleColumnValueFilter
to limit result to only small subset of region. But current implementation is loading both
CFs to perform scan, when only small subset is needed.
> Attached patch adds one routine to Filter interface to allow filter to specify which
CF is needed to it's operation. In HRegion, we separate all scanners into two groups: needed
for filter and the rest (joined). When new row is considered, only needed data is loaded,
filter applied, and only if filter accepts the row, rest of data is loaded. At our data, this
speeds up such kind of scans 30-50 times. Also, this gives us the way to better normalize
the data into separate columns by optimizing the scans performed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message