hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tony Dean <Tony.D...@sas.com>
Subject RE: Scan performance
Date Sat, 22 Jun 2013 03:50:02 GMT
I understand more, but have additional questions about the internals...

So, in this example I have 6000 rows X 40 columns in this table.  In this test my startRow
and stopRow do not narrow the scan criterior therefore all 6000x40 KVs must be included in
the search and thus read from disk and into memory.

The first filter that I used was:
Filter f2 = new SingleColumnValueFilter(cf, qualifier,  CompareFilter.CompareOp.EQUALS, value);

This means that HBase must look for the qualifier column on all 6000 rows.  As you mention
I could add certain columns to a different cf; but unfortunately, in my case there is no such
small set of columns that will need to be compared (filtered on).  I could try to use indexes
so that a complete row key can be calculated from a secondary index in order to perform a
faster search against data in a primary table.  This requires additional tables and maintenance
that I would like to avoid.

I did try a row key filter with regex hoping that it would limit the number of rows that were
read from disk.
Filter f2 = new RowFilter(CompareFilter.CompareOp.EQUAL, new RegexStringComparator(row_regexpr));

My row keys are something like: vid,sid,event.  sid is not known at query time so I can use
a regex similar to: vid,.*,Logon where Logon is the event that I am looking for in a particular
visit.  In my test data this should have narrowed the scan to 1 row X 40 columns.  The best
I could do for start/stop row is: vid,0 and vid,~ respectively.  I guess that is still going
to cause all 6000 rows to be scanned, but the filtering should be more specific with the rowKey
filter.  However, I did not see any performance improvement.  Anything obvious?

Do you have any other ideas to help out with performance when row key is: vid,sid,event and
sid is not known at query time which leaves a gap in the start/stop row?  Too bad regex can't
be used in start/stop row specification.  That's really what I need.

Thanks again.

-----Original Message-----
From: Vladimir Rodionov [mailto:vrodionov@carrieriq.com] 
Sent: Friday, June 21, 2013 8:00 PM
To: user@hbase.apache.org; lars hofhansl
Subject: RE: Scan performance

I thought that column family is the locality group and placement columns which are frequently
accessed together into the same column family (locality group) is the obvious performance
improvement tip. What are the "essential column families" for in this context?

As for original question..  Unless you place your column into a separate column family in
Table 2, you will need to scan (load from disk if not cached) ~ 40x more data for the second
table (because you have 40 columns). This may explain why do  see such a difference in execution
time if all data needs to be loaded first from HDFS.

Best regards,
Vladimir Rodionov
Principal Platform Engineer
Carrier IQ, www.carrieriq.com
e-mail: vrodionov@carrieriq.com

From: lars hofhansl [larsh@apache.org]
Sent: Friday, June 21, 2013 3:37 PM
To: user@hbase.apache.org
Subject: Re: Scan performance

HBase is a key value (KV) store. Each column is stored in its own KV, a row is just a set
of KVs that happen to have the row key (which is the first part of the key).
I tried to summarize this here: http://hadoop-hbase.blogspot.de/2011/12/introduction-to-hbase.html)

In the StoreFiles all KVs are sorted in row/column order, but HBase still needs to skip over
many KVs in order to "reach" the next row. So more disk and memory IO is needed.

If you using 0.94 there is a new feature "essential column families". If you always search
by the same column you can place that one in its own column family and all other column in
another column family. In that case your scan performance should be close identical.

-- Lars

From: Tony Dean <Tony.Dean@sas.com>
To: "user@hbase.apache.org" <user@hbase.apache.org>
Sent: Friday, June 21, 2013 2:08 PM
Subject: Scan performance


I hope that you can shed some light on these 2 scenarios below.

I have 2 small tables of 6000 rows.
Table 1 has only 1 column in each of its rows.
Table 2 has 40 columns in each of its rows.
Other than that the two tables are identical.

In both tables there is only 1 row that contains a matching column that I am filtering on.
  And the Scan performs correctly in both cases by returning only the single result.

The code looks something like the following:

Scan scan = new Scan(startRow, stopRow);   // the start/stop rows should include all 6000
scan.addColumn(cf, qualifier); // only return the column that I am interested in (should only
be in 1 row and only 1 version)

Filter f1 = new InclusiveStopFilter(stopRow); Filter f2 = new SingleColumnValueFilter(cf,
qualifier,  CompareFilter.CompareOp.EQUALS, value); scan.setFilter(new FilterList(f1, f2));

scan .setTimeRange(0, MAX_LONG);

ResultScanner rs = t.getScanner(scan);
for (Result result: rs)


For table 1, rs.next() takes about 30ms.
For table 2, rs.next() takes about 180ms.

Both are returning the exact same result.  Why is it taking so much longer on table 2 to get
the same result?  The scan depth is the same.  The only difference is the column width.  But
I'm filtering on a single column and returning only that column.

Am I missing something?  As I increase the number of columns, the response time gets worse.
 I do expect the response time to get worse when increasing the number of rows, but not by
increasing the number of columns since I'm returning only 1 column in both cases.

I appreciate any comments that you have.


Tony Dean
SAS Institute Inc.
Principal Software Developer
919-531-6704          ...

Confidentiality Notice:  The information contained in this message, including any attachments
hereto, may be confidential and is intended to be read only by the individual or entity to
whom this message is addressed. If the reader of this message is not the intended recipient
or an agent or designee of the intended recipient, please note that any review, use, disclosure
or distribution of this message or its attachments, in any form, is strictly prohibited. 
If you have received this message in error, please immediately notify the sender and/or Notifications@carrieriq.com
and delete or destroy any copy of this message and its attachments.

View raw message