lars hofhansl
Subject Re: Slow full-table scans
Date Sun, 12 Aug 2012
Do you really have to retrieve all 200.000 each time?
Scan.setBatch(...) makes no difference?! (note that batching is different and separate from

Also note that the scanner contract is to return sorted KVs, so a single scan cannot be parallelized
across RegionServers (well not entirely true, it could be farmed off in parallel and then
be presented to the client in the right order - but HBase is not doing that). That is why
one vs 12 RSs makes no difference in this scenario.

In the 12 node case you'll see low CPU on all but one RS, and each RS will get its turn.

In your case this is scanning 20.000.000 KVs serially in 400s, that's 50000 KVs/s, which -
depending on hardware - is not too bad for HBase (but not great either).

If you only ever expect to run a single query like this on top your cluster (i.e. your concern
is latency not throughput) you can do multiple RPCs in parallel for a sub portion of your
key range. Together with batching can start using value before all is streamed back from the

-- Lars


Gurjeet Singh

Saturday, August 11, 2012
Subject: Slow full-table scans


I am trying to read all the data out of an HBase table using a scan
and it is extremely slow.

Here are some characteristics of the data:

1. The total table size is tiny (~200MB)
2. The table has ~100 rows and ~200,000 columns in a SINGLE family.
Thus the size of each cell is ~10bytes and the size of each row is
3. Currently scanning the whole table takes ~400s (both in a
distributed setting with 12 nodes or so and on a single node), thus
4. The row keys are unique 8 byte crypto hashes of sequential numbers
5. The scanner is set to fetch a FULL row at a time (scan.setBatch)
and is set to fetch 100MB of data at a time (scan.setCaching)
6. Changing the caching size seems to have no effect on the total scan
time at all
7. The column family is setup to keep a single version of the cells,
no compression, and no block cache.

Am I missing something ? Is there a way to optimize this ?

I guess a general question I have is whether HBase is good datastore
for storing many medium sized (~50GB), dense datasets with lots of
columns when a lot of the queries require full table scans ?


