hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erik Holstad (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HBASE-1249) Rearchitecting of server, client, API, key format, etc for 0.20
Date Wed, 29 Apr 2009 00:38:30 GMT

    [ https://issues.apache.org/jira/browse/HBASE-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12703916#action_12703916
] 

Erik Holstad commented on HBASE-1249:
-------------------------------------

@Stack
+ I totally agree that the naming of the compareTo(KeyValue) method is not the best, should
probably be something else like you suggested hasEnough or shouldAdd or something along those
lines.

+ Why this new code is faster than the old one. There are a couple of things that will make
this new code faster than the old one:
You have your KeyValue iterator, data length k, your columns asked for, gets length l,  and
your deleteStructure, deletes length m.
1. The way a get is done today is that for every data you compare it to all the columns asked
for and for every match you do a contains on that data. This means that you will do k*l +l*something
for the delete contains check. Since the deletes are stored in a sortedSet every insert into
it takes log(n)
2. The way the KeyValue object is used in many places it to call getRowLength and then getColumnLength
after each other, this means that you have to do all the calculations for the lengths and
offsets multiple times. 
3. When not having a structure from the client that groups families together you will have
to get the right store for every KeyValue. 

1. The new code don't have any complicated data structures, just ArrayList<KeyValue>
and the only operation that is done to these are add(KeyValue), so the time complexity if
compared to the old way is k+l+m which is much fewer compares. 
2. In the new code I have moved all the compare methods in to the actual code, this means
that there will be more code that is duplicated, but I think that it is a better approach
when we are going for speed. I don't recalculate and lengths or offset if it is not absolutely
necessary .

But the biggest gain in time comes from not having complicated data structures on the server
but rather keeping it simple. Of course there are other things that becomes more complicated
like merging the lists after every storefile, but there is no way around that as I see it.
Don't think it gets much faster than doing a sorted merge.


> Rearchitecting of server, client, API, key format, etc for 0.20
> ---------------------------------------------------------------
>
>                 Key: HBASE-1249
>                 URL: https://issues.apache.org/jira/browse/HBASE-1249
>             Project: Hadoop HBase
>          Issue Type: Improvement
>            Reporter: Jonathan Gray
>            Priority: Blocker
>             Fix For: 0.20.0
>
>         Attachments: HBASE-1249-Example-v1.pdf, HBASE-1249-Example-v2.pdf, HBASE-1249-GetQuery-v1.pdf,
HBASE-1249-GetQuery-v2.pdf, HBASE-1249-GetQuery-v3.pdf, HBASE-1249-GetQuery-v4.pdf, HBASE-1249-StoreFile-v1.pdf,
HBASE-1249-StoreFile-v4.pdf
>
>
> To discuss all the new and potential issues coming out of the change in key format (HBASE-1234):
zero-copy reads, client binary protocol, update of API (HBASE-880), server optimizations,
etc...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message