Return-Path: X-Original-To: apmail-hbase-issues-archive@www.apache.org Delivered-To: apmail-hbase-issues-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 05A1ADBA7 for ; Wed, 19 Dec 2012 23:25:16 +0000 (UTC) Received: (qmail 427 invoked by uid 500); 19 Dec 2012 23:25:15 -0000 Delivered-To: apmail-hbase-issues-archive@hbase.apache.org Received: (qmail 379 invoked by uid 500); 19 Dec 2012 23:25:15 -0000 Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@hbase.apache.org Received: (qmail 370 invoked by uid 99); 19 Dec 2012 23:25:15 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 19 Dec 2012 23:25:15 +0000 Date: Wed, 19 Dec 2012 23:25:15 +0000 (UTC) From: "Sergey Shelukhin (JIRA)" To: issues@hbase.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters. MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13536571#comment-13536571 ] Sergey Shelukhin commented on HBASE-5416: ----------------------------------------- [~stack] the only change on the main path not conditional on joinedScanners/joinedHeap/etc. being there seems to be refactoring the while loop under {code} } else if (filterRowKey(currentRow, offset, length)) { {code} into populateResult, which in this case will do one extra matchingRow check of storeHeap current row against itself (it pre-checks current heap KV instead of post-checking, but in this case the row we use to check was just created from this very heap). I don't think test could be that targeted even though when refactoring there's potential to add bugs... > Improve performance of scans with some kind of filters. > ------------------------------------------------------- > > Key: HBASE-5416 > URL: https://issues.apache.org/jira/browse/HBASE-5416 > Project: HBase > Issue Type: Improvement > Components: Filters, Performance, regionserver > Affects Versions: 0.90.4 > Reporter: Max Lapan > Assignee: Sergey Shelukhin > Fix For: 0.96.0 > > Attachments: 5416-Filtered_scans_v6.patch, 5416-v5.txt, 5416-v6.txt, Filtered_scans.patch, Filtered_scans_v2.patch, Filtered_scans_v3.patch, Filtered_scans_v4.patch, Filtered_scans_v5.1.patch, Filtered_scans_v5.patch, Filtered_scans_v7.patch, HBASE-5416-v10.patch, HBASE-5416-v11.patch, HBASE-5416-v7-rebased.patch, HBASE-5416-v8.patch, HBASE-5416-v9.patch > > > When the scan is performed, whole row is loaded into result list, after that filter (if exists) is applied to detect that row is needed. > But when scan is performed on several CFs and filter checks only data from the subset of these CFs, data from CFs, not checked by a filter is not needed on a filter stage. Only when we decided to include current row. And in such case we can significantly reduce amount of IO performed by a scan, by loading only values, actually checked by a filter. > For example, we have two CFs: flags and snap. Flags is quite small (bunch of megabytes) and is used to filter large entries from snap. Snap is very large (10s of GB) and it is quite costly to scan it. If we needed only rows with some flag specified, we use SingleColumnValueFilter to limit result to only small subset of region. But current implementation is loading both CFs to perform scan, when only small subset is needed. > Attached patch adds one routine to Filter interface to allow filter to specify which CF is needed to it's operation. In HRegion, we separate all scanners into two groups: needed for filter and the rest (joined). When new row is considered, only needed data is loaded, filter applied, and only if filter accepts the row, rest of data is loaded. At our data, this speeds up such kind of scans 30-50 times. Also, this gives us the way to better normalize the data into separate columns by optimizing the scans performed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira