hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Omkar Joshi <Omkar.Jo...@lntinfotech.com>
Subject RE: Speeding up the row count
Date Fri, 19 Apr 2013 07:33:11 GMT

I'm having a 2-node(VMs) Hadoop cluster atop which HBase is running in the distributed mode.

I'm having a table named ORDERS with >100000 rows.

NOTE : Since my cluster is ultra-small, I didn't pre-split the table.

rowkey :                ORDER_ID

column family : ORDER_DETAILS
        columns : CUSTOMER_ID

The java client code to simply check the count of the records is :

public long getTableCount(String tableName, String columnFamilyName) {

                AggregationClient aggregationClient = new AggregationClient(config);
                Scan scan = new Scan();
                scan.setFilter(new FirstKeyOnlyFilter());

                long rowCount = 0;

                try {
                        rowCount = aggregationClient.rowCount(Bytes.toBytes(tableName),
                                        null, scan);
                        System.out.println("No. of rows in " + tableName + " is "
                                        + rowCount);
                } catch (Throwable e) {
                        // TODO Auto-generated catch block

                return rowCount;

It is running for more than 6 minutes now :(

What shall I do to speed up the execution to milliseconds(at least a couple of seconds)?

Omkar Joshi

-----Original Message-----
From: Vedad Kirlic [mailto:kirlich@gmail.com]
Sent: Thursday, April 18, 2013 12:22 AM
To: user@hbase.apache.org
Subject: Re: Speeding up the row count

Hi Omkar,

If you are not interested in occurrences of specific column (e.g. name,
email ... ), and just want to get total number of rows (regardless of their
content - i.e. columns), you should avoid adding any columns to the Scan, in
which case coprocessor implementation for AggregateClient, will add
FirstKeyOnlyFilter to the Scan, so to avoid loading unnecessary columns, so
this should result in some speed up.

This is similar approach to what hbase shell 'count' implementation does,
although reduction in overhead in that case is bigger, since data transfer
from region server to client (shell) is minimized, whereas in case of
coprocessor, data does not leave region server, so most of the improvement
in that case should come from avoiding loading of unnecessary files. Not
sure how this will apply to your particular case, given that data set per
row seems to be rather small. Also, in case of AggregateClient you will
benefit if/when your tables span multiple regions. Essentially, performance
of this approach will 'degrade' as your table gets bigger, but only to the
point when it splits, from which point it should be pretty constant. Having
this in mind, and your type of data, you might consider pre-splitting your

DISCLAIMER: this is mostly theoretical, since I'm not an expert in hbase
internals :), so your best bet is to try it - I'm too lazy to verify impact
my self ;)

Finally, if your case can tolerate eventual consistency of counters with
actual number of rows, you can, as already suggested, have RowCounter map
reduce run every once in a while, write the counter(s) back to hbase, and
read those when you need to obtain the number of rows.


View this message in context: http://apache-hbase.679495.n3.nabble.com/Speeding-up-the-row-count-tp4042378p4042415.html
Sent from the HBase User mailing list archive at Nabble.com.

The contents of this e-mail and any attachment(s) may contain confidential or privileged information
for the intended recipient(s). Unintended recipients are prohibited from taking action on
the basis of information in this e-mail and  using or disseminating the information,  and
must notify the sender and delete it from their system. L&T Infotech will not accept responsibility
or liability for the accuracy or completeness of, or the presence of any virus or disabling
code in this e-mail"

View raw message