Return-Path: X-Original-To: apmail-pig-dev-archive@www.apache.org Delivered-To: apmail-pig-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id AB4DBD4CE for ; Wed, 26 Sep 2012 21:59:08 +0000 (UTC) Received: (qmail 68695 invoked by uid 500); 26 Sep 2012 21:59:08 -0000 Delivered-To: apmail-pig-dev-archive@pig.apache.org Received: (qmail 68627 invoked by uid 500); 26 Sep 2012 21:59:08 -0000 Mailing-List: contact dev-help@pig.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@pig.apache.org Delivered-To: mailing list dev@pig.apache.org Received: (qmail 68545 invoked by uid 500); 26 Sep 2012 21:59:08 -0000 Delivered-To: apmail-hadoop-pig-dev@hadoop.apache.org Received: (qmail 68499 invoked by uid 99); 26 Sep 2012 21:59:08 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 26 Sep 2012 21:59:07 +0000 Date: Thu, 27 Sep 2012 08:59:07 +1100 (NCT) From: "Bill Graham (JIRA)" To: pig-dev@hadoop.apache.org Message-ID: <285652469.130892.1348696747990.JavaMail.jiratomcat@arcas> Subject: [jira] [Created] (PIG-2934) HBaseStorage filter optimizations MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 Bill Graham created PIG-2934: -------------------------------- Summary: HBaseStorage filter optimizations Key: PIG-2934 URL: https://issues.apache.org/jira/browse/PIG-2934 Project: Pig Issue Type: Improvement Reporter: Bill Graham Assignee: Bill Graham Our HBase pal/guru Gary Helmling was kind enough to do a code review of HBaseStorage. He suggested some good filter optimizations: * when using the "lt*" and "gt*" options, set the start/stop rows on the Scan instance, at least in addition to the RowFilters. Without this you're doing a full table scan, regardless of the RowFilters. * when selecting specific columns or entire families to return, it would be more efficient to set the family + columns on the Scan object (addFamily(), addColumn()), instead of using a FilterList. I'm not familiar with the family:prefix handling you mention, but that would still seem to require filters. But if that's not being used, it would be better to avoid the FilterList for columns. At minimum, we should probably call Scan.addFamily() with the distinct families, so we can skip entire column families that are not being used. In the case of a table with 4 CFs, if, say, only 1 is being used, this could be a big gain. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira