hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erik Holstad (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HBASE-1249) Rearchitecting of server, client, API, key format, etc for 0.20
Date Wed, 29 Apr 2009 22:12:30 GMT

    [ https://issues.apache.org/jira/browse/HBASE-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12704373#action_12704373

Erik Holstad commented on HBASE-1249:

With the old code we used to put all the calls for for multiple columns or all 
columns for a family into the same get call, getFull, this was done even though
they are doing very different things. So what I would like to see on the server
side are 2 different pieces of code, that handles the 2 queries differently.

Let's call these calls getFamilies and getColumns for simplicity, where
getFamilies is the call you make when you want to get all the columns for a
family and getColumns when you specify some columns that you want to get.

In the new code, for the getColumns call I remove the elements from the list
that already have the number of versions asked for, so the getList becomes
smaller and smaller.

For getFamilies the situation is different since you start with an empty getList
and build it as you go.

The problem with the getFamilies query is that you have to keep the numbers of
versions around even though you have gotten them all.

The old way of doing things was to build a map that had an entry for every 
KeyValue fetched that mapped to the number of versions fetched.

The new approach is to keep a newGet list where you insert the new things from
the current storefile and then before going into the next storefile merge these
2 lists together, the same way as you do with deletes.
This is just one small detail in the new implementation, but I thought it was
important to bring it up since it brings some extra complexity from the
otherwise pretty simple getColumns code, where most of the code from getFamilies
is taken from.

Some timing results:
These test are only done on the server side and doesn't include any time for
sending data between server and client. I basically created a HRegion in the
test, inserted data into it and queried it with 2 different calls the old
getFull() and the semi old, getFamilies(). The test were done by inserting
data at 3 different times with a flush in between so you end up with data in
memcache and data in 2 storefiles. The region only have one family and the
qualifiers are numbers ranging from 1-n, so you have 3 versions of every insert.
The get query was to get 3 versions of all the data, so you need to go through
all the storefiles and no way of early out, which is always the case when it
comes down to a query like this.

n = 10
new timer 483525
old timer 2619048

n = 100
new timer 3237894
old timer 15307717

n = 1000
new timer 22449940
old timer 435134261

These test doesn't include any deletes. In the case of deletes in the system I
think that the difference will be even greater.

To be clear about the parts where extra testing and time is need to be spent are
the following areas in the new code:
Merging of deletes, merging of Gets and handling of versions in the case of a
"GetFamilies" call.

I did find some bugs while running this timing test, so will keep working
towards a good patch where most of that stuff is taken care of.

> Rearchitecting of server, client, API, key format, etc for 0.20
> ---------------------------------------------------------------
>                 Key: HBASE-1249
>                 URL: https://issues.apache.org/jira/browse/HBASE-1249
>             Project: Hadoop HBase
>          Issue Type: Improvement
>            Reporter: Jonathan Gray
>            Priority: Blocker
>             Fix For: 0.20.0
>         Attachments: HBASE-1249-Example-v1.pdf, HBASE-1249-Example-v2.pdf, HBASE-1249-GetQuery-v1.pdf,
HBASE-1249-GetQuery-v2.pdf, HBASE-1249-GetQuery-v3.pdf, HBASE-1249-GetQuery-v4.pdf, HBASE-1249-StoreFile-v1.pdf,
> To discuss all the new and potential issues coming out of the change in key format (HBASE-1234):
zero-copy reads, client binary protocol, update of API (HBASE-880), server optimizations,

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message