hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ryan rawson (JIRA)" <j...@apache.org>
Subject [jira] Created: (HBASE-3073) New APIs for Result, faster implementation for some calls
Date Fri, 01 Oct 2010 23:55:32 GMT
New APIs for Result, faster implementation for some calls

                 Key: HBASE-3073
                 URL: https://issues.apache.org/jira/browse/HBASE-3073
             Project: HBase
          Issue Type: Bug
    Affects Versions: 0.89.20100924
            Reporter: ryan rawson
            Assignee: ryan rawson
             Fix For: 0.90.0

Our existing API for Result hasn't been given much love in the last year.  In the mean time,
inefficiencies in the existing implementation have come to light, causing issues with benchmarks.
 Furthermore, some people are finding the API both difficult to use as well as not useful
enough (See: HBASE-1937).

I propose the following new APIs:
public List<KeyValue> getColumn(byte [] family, byte [] qualifier);
public KeyValue getColumnLatest(byte [] family, byte [] qualifier);

The implementation of these use a binary search on the underlying kvs array (which is sorted).
 I also have new implementations for
public boolean containsColumn(byte [] family, byte [] qualifier);
public byte [] getValue(byte [] family, byte [] qualifier);

Which in the small case run faster, but in the big case seem to run a bit slower.  That is
if you call getValue() 10 times for a Result it will be faster with the new implementation,
but if you call getValue() 100 times for the same Result it is faster using the old implementation.
 My tests indicated about 10% slower on 'getValue' 100x with an overall 1000x iteration on
1000 different Result objects.  Considering most people use getValue() to retrieve named columns
and iteration when the qualifier list is unknown I think this is a reasonable trade off.

Along with the new API, there is a recommendation to use raw() to get the list of KeyValue
objects for iteration.  This increases the visibility of KeyValue, and also is much faster
to iterate (4.9 times on my mini benchmark, 100 columns per Result, redone 1000 times on different
Result objects).

Given my recent major speed boost by changing YCSB to use the raw() interface, I think that
this is a must have for 0.90.  

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message