hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ted Yu (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.
Date Fri, 14 Dec 2012 04:40:17 GMT

    [ https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13532040#comment-13532040
] 

Ted Yu commented on HBASE-5416:
-------------------------------

Here is test result from Linux:
{code}
grep 'scanner finished in' testJoinedScanners-output.txt
2012-12-13 20:28:36,780 INFO  [main] regionserver.TestJoinedScanners(172): Slow scanner finished
in 29.421479079 seconds, got 100 rows
2012-12-13 20:28:47,617 INFO  [main] regionserver.TestJoinedScanners(172): Joined scanner
finished in 10.836890451 seconds, got 100 rows
2012-12-13 20:28:58,637 INFO  [main] regionserver.TestJoinedScanners(172): Slow scanner finished
in 11.019543361 seconds, got 100 rows
2012-12-13 20:29:07,865 INFO  [main] regionserver.TestJoinedScanners(172): Joined scanner
finished in 9.227820454 seconds, got 100 rows
2012-12-13 20:29:17,690 INFO  [main] regionserver.TestJoinedScanners(172): Slow scanner finished
in 9.824966218 seconds, got 100 rows
2012-12-13 20:29:26,317 INFO  [main] regionserver.TestJoinedScanners(172): Joined scanner
finished in 8.626794601 seconds, got 100 rows
2012-12-13 20:29:36,288 INFO  [main] regionserver.TestJoinedScanners(172): Slow scanner finished
in 9.97033987 seconds, got 100 rows
2012-12-13 20:29:45,033 INFO  [main] regionserver.TestJoinedScanners(172): Joined scanner
finished in 8.745137076 seconds, got 100 rows
2012-12-13 20:29:55,023 INFO  [main] regionserver.TestJoinedScanners(172): Slow scanner finished
in 9.989630848 seconds, got 100 rows
2012-12-13 20:30:03,416 INFO  [main] regionserver.TestJoinedScanners(172): Joined scanner
finished in 8.392952897 seconds, got 100 rows
2012-12-13 20:30:12,267 INFO  [main] regionserver.TestJoinedScanners(172): Slow scanner finished
in 8.850649054 seconds, got 100 rows
2012-12-13 20:30:20,985 INFO  [main] regionserver.TestJoinedScanners(172): Joined scanner
finished in 8.718266736 seconds, got 100 rows
2012-12-13 20:30:30,108 INFO  [main] regionserver.TestJoinedScanners(172): Slow scanner finished
in 9.122057799 seconds, got 100 rows
2012-12-13 20:30:38,669 INFO  [main] regionserver.TestJoinedScanners(172): Joined scanner
finished in 8.561782079 seconds, got 100 rows
2012-12-13 20:30:47,898 INFO  [main] regionserver.TestJoinedScanners(172): Slow scanner finished
in 9.228045508 seconds, got 100 rows
2012-12-13 20:30:57,057 INFO  [main] regionserver.TestJoinedScanners(172): Joined scanner
finished in 9.158965127 seconds, got 100 rows
2012-12-13 20:31:07,428 INFO  [main] regionserver.TestJoinedScanners(172): Slow scanner finished
in 10.370526135 seconds, got 100 rows
2012-12-13 20:31:16,586 INFO  [main] regionserver.TestJoinedScanners(172): Joined scanner
finished in 9.157627332 seconds, got 100 rows
2012-12-13 20:31:25,612 INFO  [main] regionserver.TestJoinedScanners(172): Slow scanner finished
in 9.026821302 seconds, got 100 rows
2012-12-13 20:31:34,553 INFO  [main] regionserver.TestJoinedScanners(172): Joined scanner
finished in 8.93992941 seconds, got 100 rows
{code}
                
> Improve performance of scans with some kind of filters.
> -------------------------------------------------------
>
>                 Key: HBASE-5416
>                 URL: https://issues.apache.org/jira/browse/HBASE-5416
>             Project: HBase
>          Issue Type: Improvement
>          Components: Filters, Performance, regionserver
>    Affects Versions: 0.90.4
>            Reporter: Max Lapan
>            Assignee: Max Lapan
>             Fix For: 0.96.0
>
>         Attachments: 5416-Filtered_scans_v6.patch, 5416-v5.txt, 5416-v6.txt, Filtered_scans.patch,
Filtered_scans_v2.patch, Filtered_scans_v3.patch, Filtered_scans_v4.patch, Filtered_scans_v5.1.patch,
Filtered_scans_v5.patch, Filtered_scans_v7.patch, HBASE-5416-v7-rebased.patch
>
>
> When the scan is performed, whole row is loaded into result list, after that filter (if
exists) is applied to detect that row is needed.
> But when scan is performed on several CFs and filter checks only data from the subset
of these CFs, data from CFs, not checked by a filter is not needed on a filter stage. Only
when we decided to include current row. And in such case we can significantly reduce amount
of IO performed by a scan, by loading only values, actually checked by a filter.
> For example, we have two CFs: flags and snap. Flags is quite small (bunch of megabytes)
and is used to filter large entries from snap. Snap is very large (10s of GB) and it is quite
costly to scan it. If we needed only rows with some flag specified, we use SingleColumnValueFilter
to limit result to only small subset of region. But current implementation is loading both
CFs to perform scan, when only small subset is needed.
> Attached patch adds one routine to Filter interface to allow filter to specify which
CF is needed to it's operation. In HRegion, we separate all scanners into two groups: needed
for filter and the rest (joined). When new row is considered, only needed data is loaded,
filter applied, and only if filter accepts the row, rest of data is loaded. At our data, this
speeds up such kind of scans 30-50 times. Also, this gives us the way to better normalize
the data into separate columns by optimizing the scans performed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message