hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gurjeet Singh <gurj...@gmail.com>
Subject Re: Slow full-table scans
Date Sun, 12 Aug 2012 22:52:37 GMT
Hi Mohammad,

This is a great idea. Is there a API call to determine the start/end
key for each region ?


On Sun, Aug 12, 2012 at 3:49 PM, Mohammad Tariq <dontariq@gmail.com> wrote:
> Hello experts,
>        Would it be feasible to create a separate thread for each region??I
> mean we can determine start and end key of each region and issue a scan for
> each region in parallel.
> Regards,
>     Mohammad Tariq
> On Mon, Aug 13, 2012 at 3:54 AM, lars hofhansl <lhofhansl@yahoo.com> wrote:
>> Do you really have to retrieve all 200.000 each time?
>> Scan.setBatch(...) makes no difference?! (note that batching is different
>> and separate from caching).
>> Also note that the scanner contract is to return sorted KVs, so a single
>> scan cannot be parallelized across RegionServers (well not entirely true,
>> it could be farmed off in parallel and then be presented to the client in
>> the right order - but HBase is not doing that). That is why one vs 12 RSs
>> makes no difference in this scenario.
>> In the 12 node case you'll see low CPU on all but one RS, and each RS will
>> get its turn.
>> In your case this is scanning 20.000.000 KVs serially in 400s, that's
>> 50000 KVs/s, which - depending on hardware - is not too bad for HBase (but
>> not great either).
>> If you only ever expect to run a single query like this on top your
>> cluster (i.e. your concern is latency not throughput) you can do multiple
>> RPCs in parallel for a sub portion of your key range. Together with
>> batching can start using value before all is streamed back from the server.
>> -- Lars
>> ----- Original Message -----
>> From: Gurjeet Singh <gurjeet@gmail.com>
>> To: user@hbase.apache.org
>> Cc:
>> Sent: Saturday, August 11, 2012 11:04 PM
>> Subject: Slow full-table scans
>> Hi,
>> I am trying to read all the data out of an HBase table using a scan
>> and it is extremely slow.
>> Here are some characteristics of the data:
>> 1. The total table size is tiny (~200MB)
>> 2. The table has ~100 rows and ~200,000 columns in a SINGLE family.
>> Thus the size of each cell is ~10bytes and the size of each row is
>> ~2MB
>> 3. Currently scanning the whole table takes ~400s (both in a
>> distributed setting with 12 nodes or so and on a single node), thus
>> 5sec/row
>> 4. The row keys are unique 8 byte crypto hashes of sequential numbers
>> 5. The scanner is set to fetch a FULL row at a time (scan.setBatch)
>> and is set to fetch 100MB of data at a time (scan.setCaching)
>> 6. Changing the caching size seems to have no effect on the total scan
>> time at all
>> 7. The column family is setup to keep a single version of the cells,
>> no compression, and no block cache.
>> Am I missing something ? Is there a way to optimize this ?
>> I guess a general question I have is whether HBase is good datastore
>> for storing many medium sized (~50GB), dense datasets with lots of
>> columns when a lot of the queries require full table scans ?
>> Thanks!
>> Gurjeet

View raw message