hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From lars hofhansl <lhofha...@yahoo.com>
Subject Re: Slow full-table scans
Date Sun, 12 Aug 2012 23:00:04 GMT
You can use HTable.{getStartEndKeys|getEndKeys|getStartKeys} to get the current region demarcations
for your table.
If you wanted to group threads by RegionServer (which you should) you get that information
via HTable.getRegionLocation{s}

-- Lars

----- Original Message -----
From: Gurjeet Singh <gurjeet@gmail.com>
To: user@hbase.apache.org; lars hofhansl <lhofhansl@yahoo.com>
Sent: Sunday, August 12, 2012 3:51 PM
Subject: Re: Slow full-table scans

Hi Lars,

Yes, I need to retrieve all the values for a row at a time. That said,
I did experiment with different batch sizes and that made no
difference whatsoever. (caching on the other hand did make some
difference ~2-3% faster for larger cache)

I see your point about scanners returning sorted KVs. In my
application, I simply don't care whether the results are sorted or not
and I know the key range in advance. This is a great suggestion. Let
me try replacing a single scan with a list of GETs or a bunch of SCANs
with different start/stop rows.


On Sun, Aug 12, 2012 at 3:24 PM, lars hofhansl <lhofhansl@yahoo.com> wrote:
> Do you really have to retrieve all 200.000 each time?
> Scan.setBatch(...) makes no difference?! (note that batching is different and separate
from caching).
> Also note that the scanner contract is to return sorted KVs, so a single scan cannot
be parallelized across RegionServers (well not entirely true, it could be farmed off in parallel
and then be presented to the client in the right order - but HBase is not doing that). That
is why one vs 12 RSs makes no difference in this scenario.
> In the 12 node case you'll see low CPU on all but one RS, and each RS will get its turn.
> In your case this is scanning 20.000.000 KVs serially in 400s, that's 50000 KVs/s, which
- depending on hardware - is not too bad for HBase (but not great either).
> If you only ever expect to run a single query like this on top your cluster (i.e. your
concern is latency not throughput) you can do multiple RPCs in parallel for a sub portion
of your key range. Together with batching can start using value before all is streamed back
from the server.
> -- Lars
> ----- Original Message -----
> From: Gurjeet Singh <gurjeet@gmail.com>
> To: user@hbase.apache.org
> Cc:
> Sent: Saturday, August 11, 2012 11:04 PM
> Subject: Slow full-table scans
> Hi,
> I am trying to read all the data out of an HBase table using a scan
> and it is extremely slow.
> Here are some characteristics of the data:
> 1. The total table size is tiny (~200MB)
> 2. The table has ~100 rows and ~200,000 columns in a SINGLE family.
> Thus the size of each cell is ~10bytes and the size of each row is
> ~2MB
> 3. Currently scanning the whole table takes ~400s (both in a
> distributed setting with 12 nodes or so and on a single node), thus
> 5sec/row
> 4. The row keys are unique 8 byte crypto hashes of sequential numbers
> 5. The scanner is set to fetch a FULL row at a time (scan.setBatch)
> and is set to fetch 100MB of data at a time (scan.setCaching)
> 6. Changing the caching size seems to have no effect on the total scan
> time at all
> 7. The column family is setup to keep a single version of the cells,
> no compression, and no block cache.
> Am I missing something ? Is there a way to optimize this ?
> I guess a general question I have is whether HBase is good datastore
> for storing many medium sized (~50GB), dense datasets with lots of
> columns when a lot of the queries require full table scans ?
> Thanks!
> Gurjeet

View raw message