hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Anoop Sam John (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.
Date Fri, 26 Oct 2012 08:19:18 GMT

    [ https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13484783#comment-13484783
] 

Anoop Sam John commented on HBASE-5416:
---------------------------------------

I got a chance to go throw this and the discussion around
@Max clearly it is a good idea. Improvement in your scenario will be huge..
The concerns about the change is worth considering I guess. It is very critical path..
I have one idea for you to solve the problem with out 2 phase RPC
How about the below way?
eg: I have one table with 2 CFs(cf1, cf2) I have a SCVF condition on cf1 (cf1:c1=v1)
1. Create a Scan from the client side with only cf1 specified and with the filter

{code}
SingleColumnValueFilter filter = new SingleColumnValueFilter(cf1, c1,
        CompareOp.EQUAL, v1);
Scan scan = new Scan();
scan.setFilter(filter);
scan.addFamily(cf1);
for (Result result : ht.getScanner(scan)){
// deal with result
}
{code}
2. Implement a RegionObserver CP and implement the preScannerNext() hook.. This hook execution
will happen within the server
In the hook for every rowkey which the scan selects, create a Get request with CF specified
as the remaining CFs and add those KVs also to the Result
{code}
public boolean postScannerNext(ObserverContext<RegionCoprocessorEnvironment> e,
      InternalScanner s, List<Result> results, int limit, boolean hasMore) throws IOException
{
    // Next call happen on one region from HRS
    HRegion region = e.getEnvironment().getRegion();
    List<Result> finalResults = new ArrayList<Result>(results.size());
    for (Result result : results) {
      // Every result corresponds to one row.. Assume there is no batching being used
      byte[] row = result.getRow();
      Get get = new Get(row);
      get.addFamily(cf2);// cf1 is already fetched
      Result result2 = region.get(get, null);
      List<KeyValue> finalKVs = new ArrayList<KeyValue>();
      finalKVs.addAll(result.list());
      finalKVs.addAll(result2.list());
      finalResults.add(new Result(finalKVs));
    }
    // replace the results with the new finalResults
    results.clear();
    results.addAll(finalResults);
    return hasMore;
  }
{code}
This hook is at the HRS level and after the Result object preperation. Right now we dont have
any other hook during the scanner next() calls down the line so that we can deal with the
KVs list.. So we need to recreate the Result and some ugly way of coding...
This way it should be possible to fetch the data what you want. May be not as optimal as the
way with the internal change.. But still be far far better than the 2 RPC calls...
Now with CP we can achieve many things..
                
> Improve performance of scans with some kind of filters.
> -------------------------------------------------------
>
>                 Key: HBASE-5416
>                 URL: https://issues.apache.org/jira/browse/HBASE-5416
>             Project: HBase
>          Issue Type: Improvement
>          Components: Filters, Performance, regionserver
>    Affects Versions: 0.90.4
>            Reporter: Max Lapan
>            Assignee: Max Lapan
>             Fix For: 0.96.0
>
>         Attachments: 5416-Filtered_scans_v6.patch, 5416-v5.txt, 5416-v6.txt, Filtered_scans.patch,
Filtered_scans_v2.patch, Filtered_scans_v3.patch, Filtered_scans_v4.patch, Filtered_scans_v5.1.patch,
Filtered_scans_v5.patch, Filtered_scans_v7.patch
>
>
> When the scan is performed, whole row is loaded into result list, after that filter (if
exists) is applied to detect that row is needed.
> But when scan is performed on several CFs and filter checks only data from the subset
of these CFs, data from CFs, not checked by a filter is not needed on a filter stage. Only
when we decided to include current row. And in such case we can significantly reduce amount
of IO performed by a scan, by loading only values, actually checked by a filter.
> For example, we have two CFs: flags and snap. Flags is quite small (bunch of megabytes)
and is used to filter large entries from snap. Snap is very large (10s of GB) and it is quite
costly to scan it. If we needed only rows with some flag specified, we use SingleColumnValueFilter
to limit result to only small subset of region. But current implementation is loading both
CFs to perform scan, when only small subset is needed.
> Attached patch adds one routine to Filter interface to allow filter to specify which
CF is needed to it's operation. In HRegion, we separate all scanners into two groups: needed
for filter and the rest (joined). When new row is considered, only needed data is loaded,
filter applied, and only if filter accepts the row, rest of data is loaded. At our data, this
speeds up such kind of scans 30-50 times. Also, this gives us the way to better normalize
the data into separate columns by optimizing the scans performed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message