Mailing-List: contact hbase-dev-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hbase-dev@hadoop.apache.org
Message-ID: <17449085.1238541950503.JavaMail.jira@brutus>
Date: Tue, 31 Mar 2009 16:25:50 -0700 (PDT)
From: "Erik Holstad (JIRA)" <jira@apache.org>
To: hbase-dev@hadoop.apache.org
Subject: [jira] Commented: (HBASE-74) [performance] When a get or scan
 request spans multiple columns, execute the reads in parallel
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/HBASE-74?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12694320#action_12694320 ] 

Erik Holstad commented on HBASE-74:
-----------------------------------

When working on HBASE-1249 this thought came to my mind so I tried to design the new system so that it would be pretty easy to add this. 
There are still a few things that need to be done to make this work properly and I haven't ran any test to see how much we would gain.


For making things parallel there are a couple of places where that can be done.
If start query is a [Get] then we can make things in parallel in multiple places:
HRegionServer: Every get can be done in parallel.
HRegion : Every family in the get can be done in parallel
HStore : Every read from memCache + storefiles can be done in parallel.


Starting at the bottom to support parallel computation of data.
HStore :
  So you have a nunch of lists that you need to compare, they are:
  1. Data list in sf, storefile
  2. The get list, with families and columns to look for
  3. The result
  4.  The deletes from previous sf
  5.  The deletes from this read.
  Data, get, result, oldDeletes, newDeletes.

  With current layout where puts and deletes are mixed you can :
  1. Compare the data in the different sf with the get and create a list of
  candidates and a list of new deletes for that sf. The compare includes checks
  for TimeRange, TTL and number of versions.

  2. Merge deletes one by one starting at memCache and moving down the sfs. For
  every merge you send that new delete list into the serverGet it belongs to and
  move on to the merge with next new delete list.

  3. When all delete checks are done you are left with your candidate lists from
  all the sfs, they now needs to be merged and checked for number of versions.

  So you have:
  1. GetCandidates and new deletes
  2. Merge deletes and check sgets towards the merged deletes
  3. Merge candidates

 
  For parallel you have a list of sget with the same data, sgets
 
  // This call can be threaded
  1. sget.createCandidates(List<KeyValue> data, boolean multiFamily)

 
  2. for(int i=0; i<sgets.length; i++){
        oldDeletes = mergeDeletes(oldDeletes, sgets[i].getDeletes())
       
        // This call can be threaded
        sget[i].compareDeletes(oldDeletes)
     }
  
  3. result = mergeCandidates(list<sget>)

Doing this can probably increase speed for a lot of cases, but I think that it will have the biggest impact on the GetFamilies query, before getFull, since you for that query need to look in all the storefiles anyways, which might not be the case for other queries. 

I think that it would be too hard to thread the gets from different families, specially now that we don't need to sort the result on the client side but can just append it to the list.
Threading multiple gets shouldn't be too hard either.


> [performance] When a get or scan request spans multiple columns, execute the reads in parallel
> ----------------------------------------------------------------------------------------------
>
>                 Key: HBASE-74
>                 URL: https://issues.apache.org/jira/browse/HBASE-74
>             Project: Hadoop HBase
>          Issue Type: Improvement
>          Components: regionserver
>            Reporter: Jim Kellerman
>            Priority: Critical
>             Fix For: 0.20.0
>
>
> When a get or scan request spans multiple columns, execute the reads in parallel and use a CountDownLatch to wait for them to complete before returning the results.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.