hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From lars hofhansl <la...@apache.org>
Subject Re: Scan performance
Date Fri, 21 Jun 2013 22:37:51 GMT
HBase is a key value (KV) store. Each column is stored in its own KV, a row is just a set of
KVs that happen to have the row key (which is the first part of the key).
I tried to summarize this here: http://hadoop-hbase.blogspot.de/2011/12/introduction-to-hbase.html)

In the StoreFiles all KVs are sorted in row/column order, but HBase still needs to skip over
many KVs in order to "reach" the next row. So more disk and memory IO is needed.

If you using 0.94 there is a new feature "essential column families". If you always search
by the same column you can place that one in its own column family and all other column in
another column family. In that case your scan performance should be close identical.

-- Lars

From: Tony Dean <Tony.Dean@sas.com>
To: "user@hbase.apache.org" <user@hbase.apache.org> 
Sent: Friday, June 21, 2013 2:08 PM
Subject: Scan performance


I hope that you can shed some light on these 2 scenarios below.

I have 2 small tables of 6000 rows.
Table 1 has only 1 column in each of its rows.
Table 2 has 40 columns in each of its rows.
Other than that the two tables are identical.

In both tables there is only 1 row that contains a matching column that I am filtering on.  
And the Scan performs correctly in both cases by returning only the single result.

The code looks something like the following:

Scan scan = new Scan(startRow, stopRow);   // the start/stop rows should include all 6000
scan.addColumn(cf, qualifier); // only return the column that I am interested in (should only
be in 1 row and only 1 version)

Filter f1 = new InclusiveStopFilter(stopRow);
Filter f2 = new SingleColumnValueFilter(cf, qualifier,  CompareFilter.CompareOp.EQUALS, value);
scan.setFilter(new FilterList(f1, f2));

scan .setTimeRange(0, MAX_LONG);

ResultScanner rs = t.getScanner(scan);
for (Result result: rs)


For table 1, rs.next() takes about 30ms.
For table 2, rs.next() takes about 180ms.

Both are returning the exact same result.  Why is it taking so much longer on table 2 to
get the same result?  The scan depth is the same.  The only difference is the column width. 
But I’m filtering on a single column and returning only that column.

Am I missing something?  As I increase the number of columns, the response time gets worse. 
I do expect the response time to get worse when increasing the number of rows, but not by
increasing the number of columns since I’m returning only 1 column in
both cases.

I appreciate any comments that you have.


Tony Dean
SAS Institute Inc.
Principal Software Developer
919-531-6704          …        

View raw message