pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bill Graham (JIRA)" <j...@apache.org>
Subject [jira] [Created] (PIG-2934) HBaseStorage filter optimizations
Date Wed, 26 Sep 2012 21:59:07 GMT
Bill Graham created PIG-2934:

             Summary: HBaseStorage filter optimizations
                 Key: PIG-2934
                 URL: https://issues.apache.org/jira/browse/PIG-2934
             Project: Pig
          Issue Type: Improvement
            Reporter: Bill Graham
            Assignee: Bill Graham

Our HBase pal/guru Gary Helmling was kind enough to do a code review of HBaseStorage. He suggested
some good filter optimizations:

* when using the "lt*" and "gt*" options, set the start/stop rows on the Scan instance, at
least in addition to the RowFilters. Without this you're doing a full table scan, regardless
of the RowFilters.
* when selecting specific columns or entire families to return, it would be more efficient
to set the family + columns on the Scan object (addFamily(), addColumn()), instead of using
a FilterList. I'm not familiar with the family:prefix handling you mention, but that would
still seem to require filters. But if that's not being used, it would be better to avoid the
FilterList for columns. At minimum, we should probably call Scan.addFamily() with the distinct
families, so we can skip entire column families that are not being used. In the case of a
table with 4 CFs, if, say, only 1 is being used, this could be a big gain.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message