hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marcin Cylke <mcl.hb...@touk.pl>
Subject performance of Get from MR Job
Date Tue, 19 Jun 2012 08:37:18 GMT
Hi

I've run into some performance issues with my hadoop MapReduce Job.
Basically what I'm doing with it is:

- read data from HDFS file
- the output goes also to HDFS file (multiple ones in my scenerio)
- in my mapper I process each line and enrich it with some data read
from HBase table (I do Get each time)
- I don't use reducer

The Get performance seems not that good. On Average it is ~17.5
gets/second. Peaks are 100gets/sec (which would be desirable speed :)).
The logs are from one node only. and the performance count also.

My schema is nothing special - one ColumnFamily with 3 columns.  But I
heavilly use timestamps. My table looks like this:

{NAME => 'XYZ', FAMILIES => [{NAME => 'cf', BLOOMFILTER => 'NONE',
REPLICATION_SCOPE => '0', COMPRESSION => 'NONE', VERSIONS =>
'2147483646',  true
 TTL => '2147483647', MIN_VERSIONS => '0', BLOCKSIZE => '65536',
IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}

Look at number of VERSIONS.

And my GETs are like this:
Get get = new Get(Bytes.toBytes(key));
            get.setMaxVersions(1);
            get.setTimeRange(0, timestamp);
            get.setCacheBlocks(false);
            get.addFamily(Bytes.toBytes("cf"));
            Result res = htable.get(get);

I init that HTable like this:
htable = new HTable(config, QUERY_TABLE_NAME);
            htable.setAutoFlush(false);
            htable.setWriteBufferSize(1024 * 1024 * 12);


I've attached a sample of Get performance - first column is number of
GETs, the second is a date.

Could You suggest where I'm getting that performance penalty? What to
look at to check if I'm not doing something stupid here, what kind of
statistics?

Regards
Marcin

Mime
View raw message