hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marcin Cylke <mcl.hb...@touk.pl>
Subject Re: performance of Get from MR Job
Date Mon, 25 Jun 2012 07:32:59 GMT
On 21/06/12 14:33, Michael Segel wrote:
> I think the version issue is the killer factor here. 
> Usually performing a simple get() where you are getting the latest version of the data
on the row/cell occurs in some constant time k. This is constant regardless of the size of
the cluster and should scale in a near linear curve.  
> As JD C points out, if your storing temporal data, you should make time part of your

I've rewritten my job to load data and not fill individual timestamps
for columns, but rather add timestamp to rowkey. Now it looks like this

[previous key][Long.MAX_VALUE-timestamp]
(without braces)

My keys look like this now:


and I'm issuing a scan like this:

Scan scan = new Scan("488892772259");

So I'm searching for my key without timestamp part added. What I'm
getting back is all the rows that start with "488892772259".

Now the performance is even worse than before (with versioned data).

What I'm also observing is the "hugeness" of my tables and influence of
compression on the performance:

My initial data - stored in Hive table - is ~ 1.5GB. When I load it into
HBase it takes ~8GB. Compressing my ColumnFamily with LZO gets the size
down to ~1.5GB, but it also dramatically reduces performance.

To sum up, here are rough times of execution and rates of requests that
I've been observing (for each option I've added GET/SCAN throughput and
rough execution time):

- versioned data (uncompressed table)
    - with misses (asking for non-existent key) - ~400 gets/sec - ~1h
    - with hits (asking for existing keys) - ~150gets/sec - ~20h
- single version (with complex key)
    - uncompressed - ~30 scans/sec - ~25h
    - compressed with LZO - ~15 scans/sec - ~30h

If that would be necessary I could provide complete data - with time
distribution of the number of gets/scans.

This performance issues are very strange to me - do You have any
suggestions as to what's causing so big increase in the time of execution?


View raw message