hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erik Holstad (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HBASE-74) [performance] When a get or scan request spans multiple columns, execute the reads in parallel
Date Tue, 31 Mar 2009 23:25:50 GMT

    [ https://issues.apache.org/jira/browse/HBASE-74?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12694320#action_12694320

Erik Holstad commented on HBASE-74:

When working on HBASE-1249 this thought came to my mind so I tried to design the new system
so that it would be pretty easy to add this. 
There are still a few things that need to be done to make this work properly and I haven't
ran any test to see how much we would gain.

For making things parallel there are a couple of places where that can be done.
If start query is a [Get] then we can make things in parallel in multiple places:
HRegionServer: Every get can be done in parallel.
HRegion : Every family in the get can be done in parallel
HStore : Every read from memCache + storefiles can be done in parallel.

Starting at the bottom to support parallel computation of data.
HStore :
  So you have a nunch of lists that you need to compare, they are:
  1. Data list in sf, storefile
  2. The get list, with families and columns to look for
  3. The result
  4.  The deletes from previous sf
  5.  The deletes from this read.
  Data, get, result, oldDeletes, newDeletes.

  With current layout where puts and deletes are mixed you can :
  1. Compare the data in the different sf with the get and create a list of
  candidates and a list of new deletes for that sf. The compare includes checks
  for TimeRange, TTL and number of versions.

  2. Merge deletes one by one starting at memCache and moving down the sfs. For
  every merge you send that new delete list into the serverGet it belongs to and
  move on to the merge with next new delete list.

  3. When all delete checks are done you are left with your candidate lists from
  all the sfs, they now needs to be merged and checked for number of versions.

  So you have:
  1. GetCandidates and new deletes
  2. Merge deletes and check sgets towards the merged deletes
  3. Merge candidates

  For parallel you have a list of sget with the same data, sgets
  // This call can be threaded
  1. sget.createCandidates(List<KeyValue> data, boolean multiFamily)

  2. for(int i=0; i<sgets.length; i++){
        oldDeletes = mergeDeletes(oldDeletes, sgets[i].getDeletes())
        // This call can be threaded
  3. result = mergeCandidates(list<sget>)

Doing this can probably increase speed for a lot of cases, but I think that it will have the
biggest impact on the GetFamilies query, before getFull, since you for that query need to
look in all the storefiles anyways, which might not be the case for other queries. 

I think that it would be too hard to thread the gets from different families, specially now
that we don't need to sort the result on the client side but can just append it to the list.
Threading multiple gets shouldn't be too hard either.

> [performance] When a get or scan request spans multiple columns, execute the reads in
> ----------------------------------------------------------------------------------------------
>                 Key: HBASE-74
>                 URL: https://issues.apache.org/jira/browse/HBASE-74
>             Project: Hadoop HBase
>          Issue Type: Improvement
>          Components: regionserver
>            Reporter: Jim Kellerman
>            Priority: Critical
>             Fix For: 0.20.0
> When a get or scan request spans multiple columns, execute the reads in parallel and
use a CountDownLatch to wait for them to complete before returning the results.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message