hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jean-Daniel Cryans <jdcry...@apache.org>
Subject Re: performance of Get from MR Job
Date Tue, 26 Jun 2012 18:00:49 GMT
The increase in data size will be due to your bigger row keys which
are stored along every value. It's best to keep them on the small
side: http://hbase.apache.org/book.html#keysize

Consider writing the numbers in a binary format instead of storing
them textually, so that a long takes only 8 bytes.

As to the extreme slowness of your scans... that is indeed super extra
slow. I can only blindly guess what would be the reason, maybe you
used the same table that was present before and the old data is still
there (even if deleted). Maybe run a major compaction if that's the
case.

J-D

On Mon, Jun 25, 2012 at 12:32 AM, Marcin Cylke <mcl.hbase@touk.pl> wrote:
> On 21/06/12 14:33, Michael Segel wrote:
>> I think the version issue is the killer factor here.
>> Usually performing a simple get() where you are getting the latest version of the
data on the row/cell occurs in some constant time k. This is constant regardless of the size
of the cluster and should scale in a near linear curve.
>>
>> As JD C points out, if your storing temporal data, you should make time part of your
schema.
>
> I've rewritten my job to load data and not fill individual timestamps
> for columns, but rather add timestamp to rowkey. Now it looks like this
>
> [previous key][Long.MAX_VALUE-timestamp]
> (without braces)
>
> My keys look like this now:
>
> 488892772259223372035596613844
>
> and I'm issuing a scan like this:
>
> Scan scan = new Scan("488892772259");
> scan.setMaxVersions(1);
>
> So I'm searching for my key without timestamp part added. What I'm
> getting back is all the rows that start with "488892772259".
>
> Now the performance is even worse than before (with versioned data).
>
> What I'm also observing is the "hugeness" of my tables and influence of
> compression on the performance:
>
> My initial data - stored in Hive table - is ~ 1.5GB. When I load it into
> HBase it takes ~8GB. Compressing my ColumnFamily with LZO gets the size
> down to ~1.5GB, but it also dramatically reduces performance.
>
> To sum up, here are rough times of execution and rates of requests that
> I've been observing (for each option I've added GET/SCAN throughput and
> rough execution time):
>
> - versioned data (uncompressed table)
>    - with misses (asking for non-existent key) - ~400 gets/sec - ~1h
>    - with hits (asking for existing keys) - ~150gets/sec - ~20h
> - single version (with complex key)
>    - uncompressed - ~30 scans/sec - ~25h
>    - compressed with LZO - ~15 scans/sec - ~30h
>
> If that would be necessary I could provide complete data - with time
> distribution of the number of gets/scans.
>
> This performance issues are very strange to me - do You have any
> suggestions as to what's causing so big increase in the time of execution?
>
> Regards
> Marcin

Mime
View raw message