hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gurjeet Singh <gurj...@gmail.com>
Subject Re: Slow full-table scans
Date Mon, 13 Aug 2012 04:41:40 GMT
It seems like the client code just sits idle, waiting for data from
the regionservers.

Gurjeet

On Sun, Aug 12, 2012 at 4:13 PM, Jacques <whshub@gmail.com> wrote:
> I think the first question is where is the time spent.  Does your analysis
> show that all the time spent is on the regionservers or is a portion of the
> bottleneck on the client side?
>
> Jacques
>
>
>
> On Sun, Aug 12, 2012 at 4:00 PM, Mohammad Tariq <dontariq@gmail.com> wrote:
>
>> Methods getStartKey and getEndKey provided by  HRegionInfo class can used
>> for that purpose.
>> Also, please make sure, any HTable instance is not left opened once you are
>> are done with reads.
>> Regards,
>>     Mohammad Tariq
>>
>>
>>
>> On Mon, Aug 13, 2012 at 4:22 AM, Gurjeet Singh <gurjeet@gmail.com> wrote:
>>
>> > Hi Mohammad,
>> >
>> > This is a great idea. Is there a API call to determine the start/end
>> > key for each region ?
>> >
>> > Thanks,
>> > Gurjeet
>> >
>> > On Sun, Aug 12, 2012 at 3:49 PM, Mohammad Tariq <dontariq@gmail.com>
>> > wrote:
>> > > Hello experts,
>> > >
>> > >        Would it be feasible to create a separate thread for each
>> > region??I
>> > > mean we can determine start and end key of each region and issue a scan
>> > for
>> > > each region in parallel.
>> > >
>> > > Regards,
>> > >     Mohammad Tariq
>> > >
>> > >
>> > >
>> > > On Mon, Aug 13, 2012 at 3:54 AM, lars hofhansl <lhofhansl@yahoo.com>
>> > wrote:
>> > >
>> > >> Do you really have to retrieve all 200.000 each time?
>> > >> Scan.setBatch(...) makes no difference?! (note that batching is
>> > different
>> > >> and separate from caching).
>> > >>
>> > >> Also note that the scanner contract is to return sorted KVs, so a
>> single
>> > >> scan cannot be parallelized across RegionServers (well not entirely
>> > true,
>> > >> it could be farmed off in parallel and then be presented to the client
>> > in
>> > >> the right order - but HBase is not doing that). That is why one vs
12
>> > RSs
>> > >> makes no difference in this scenario.
>> > >>
>> > >> In the 12 node case you'll see low CPU on all but one RS, and each
RS
>> > will
>> > >> get its turn.
>> > >>
>> > >> In your case this is scanning 20.000.000 KVs serially in 400s, that's
>> > >> 50000 KVs/s, which - depending on hardware - is not too bad for HBase
>> > (but
>> > >> not great either).
>> > >>
>> > >> If you only ever expect to run a single query like this on top your
>> > >> cluster (i.e. your concern is latency not throughput) you can do
>> > multiple
>> > >> RPCs in parallel for a sub portion of your key range. Together with
>> > >> batching can start using value before all is streamed back from the
>> > server.
>> > >>
>> > >>
>> > >> -- Lars
>> > >>
>> > >>
>> > >>
>> > >> ----- Original Message -----
>> > >> From: Gurjeet Singh <gurjeet@gmail.com>
>> > >> To: user@hbase.apache.org
>> > >> Cc:
>> > >> Sent: Saturday, August 11, 2012 11:04 PM
>> > >> Subject: Slow full-table scans
>> > >>
>> > >> Hi,
>> > >>
>> > >> I am trying to read all the data out of an HBase table using a scan
>> > >> and it is extremely slow.
>> > >>
>> > >> Here are some characteristics of the data:
>> > >>
>> > >> 1. The total table size is tiny (~200MB)
>> > >> 2. The table has ~100 rows and ~200,000 columns in a SINGLE family.
>> > >> Thus the size of each cell is ~10bytes and the size of each row is
>> > >> ~2MB
>> > >> 3. Currently scanning the whole table takes ~400s (both in a
>> > >> distributed setting with 12 nodes or so and on a single node), thus
>> > >> 5sec/row
>> > >> 4. The row keys are unique 8 byte crypto hashes of sequential numbers
>> > >> 5. The scanner is set to fetch a FULL row at a time (scan.setBatch)
>> > >> and is set to fetch 100MB of data at a time (scan.setCaching)
>> > >> 6. Changing the caching size seems to have no effect on the total scan
>> > >> time at all
>> > >> 7. The column family is setup to keep a single version of the cells,
>> > >> no compression, and no block cache.
>> > >>
>> > >> Am I missing something ? Is there a way to optimize this ?
>> > >>
>> > >> I guess a general question I have is whether HBase is good datastore
>> > >> for storing many medium sized (~50GB), dense datasets with lots of
>> > >> columns when a lot of the queries require full table scans ?
>> > >>
>> > >> Thanks!
>> > >> Gurjeet
>> > >>
>> > >>
>> >
>>

Mime
View raw message