Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 504A3C769 for ; Sun, 12 Aug 2012 23:01:54 +0000 (UTC) Received: (qmail 45107 invoked by uid 500); 12 Aug 2012 23:01:52 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 45050 invoked by uid 500); 12 Aug 2012 23:01:52 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 45042 invoked by uid 99); 12 Aug 2012 23:01:52 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 12 Aug 2012 23:01:52 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=FSL_RCVD_USER,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of dontariq@gmail.com designates 209.85.216.169 as permitted sender) Received: from [209.85.216.169] (HELO mail-qc0-f169.google.com) (209.85.216.169) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 12 Aug 2012 23:01:44 +0000 Received: by qcsd16 with SMTP id d16so2229501qcs.14 for ; Sun, 12 Aug 2012 16:01:23 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; bh=8lLaWg9gMKNr3nrvsjp6reawTgCj59ZGPVWao5kuREg=; b=en9v00bWJ6C7OcPb+JBgZWO8/53oWAybh2mFkoI2Dj063TX6z33bFIya2naOdgY1Ta d+HsqQaHNHtgBLPa21/YxQ2HVjzqlH/dsvwDAHcaFf0W0Ug3z8Wf83n+oEzZCbW7soIu Lrs6GEuLCueRzN/sVVAGHfda4rtOmHz47eDxJrnz4jdDGsLOEO5FWd9N0RTgV250do5P kwptc2ytzHKQkNq9bj5s2F4Jt2De1fQ6WpvINT6qtKIXc46Xe8s8MsaUMM3Fiu8NDg02 MMS0ptmZCfHMxJBET15K/3zht3WUTlPbYS4U5+mnlLsrO0A+Q/nLhKvOrGpV6w4zmCdd 6BbQ== Received: by 10.224.189.17 with SMTP id dc17mr21559603qab.47.1344812483376; Sun, 12 Aug 2012 16:01:23 -0700 (PDT) MIME-Version: 1.0 Received: by 10.229.170.143 with HTTP; Sun, 12 Aug 2012 16:00:43 -0700 (PDT) In-Reply-To: References: <1344810241.53514.YahooMailNeo@web121704.mail.ne1.yahoo.com> From: Mohammad Tariq Date: Mon, 13 Aug 2012 04:30:43 +0530 Message-ID: Subject: Re: Slow full-table scans To: user@hbase.apache.org Content-Type: multipart/alternative; boundary=20cf303346875c74e704c7198efd --20cf303346875c74e704c7198efd Content-Type: text/plain; charset=ISO-8859-1 Methods getStartKey and getEndKey provided by HRegionInfo class can used for that purpose. Also, please make sure, any HTable instance is not left opened once you are are done with reads. Regards, Mohammad Tariq On Mon, Aug 13, 2012 at 4:22 AM, Gurjeet Singh wrote: > Hi Mohammad, > > This is a great idea. Is there a API call to determine the start/end > key for each region ? > > Thanks, > Gurjeet > > On Sun, Aug 12, 2012 at 3:49 PM, Mohammad Tariq > wrote: > > Hello experts, > > > > Would it be feasible to create a separate thread for each > region??I > > mean we can determine start and end key of each region and issue a scan > for > > each region in parallel. > > > > Regards, > > Mohammad Tariq > > > > > > > > On Mon, Aug 13, 2012 at 3:54 AM, lars hofhansl > wrote: > > > >> Do you really have to retrieve all 200.000 each time? > >> Scan.setBatch(...) makes no difference?! (note that batching is > different > >> and separate from caching). > >> > >> Also note that the scanner contract is to return sorted KVs, so a single > >> scan cannot be parallelized across RegionServers (well not entirely > true, > >> it could be farmed off in parallel and then be presented to the client > in > >> the right order - but HBase is not doing that). That is why one vs 12 > RSs > >> makes no difference in this scenario. > >> > >> In the 12 node case you'll see low CPU on all but one RS, and each RS > will > >> get its turn. > >> > >> In your case this is scanning 20.000.000 KVs serially in 400s, that's > >> 50000 KVs/s, which - depending on hardware - is not too bad for HBase > (but > >> not great either). > >> > >> If you only ever expect to run a single query like this on top your > >> cluster (i.e. your concern is latency not throughput) you can do > multiple > >> RPCs in parallel for a sub portion of your key range. Together with > >> batching can start using value before all is streamed back from the > server. > >> > >> > >> -- Lars > >> > >> > >> > >> ----- Original Message ----- > >> From: Gurjeet Singh > >> To: user@hbase.apache.org > >> Cc: > >> Sent: Saturday, August 11, 2012 11:04 PM > >> Subject: Slow full-table scans > >> > >> Hi, > >> > >> I am trying to read all the data out of an HBase table using a scan > >> and it is extremely slow. > >> > >> Here are some characteristics of the data: > >> > >> 1. The total table size is tiny (~200MB) > >> 2. The table has ~100 rows and ~200,000 columns in a SINGLE family. > >> Thus the size of each cell is ~10bytes and the size of each row is > >> ~2MB > >> 3. Currently scanning the whole table takes ~400s (both in a > >> distributed setting with 12 nodes or so and on a single node), thus > >> 5sec/row > >> 4. The row keys are unique 8 byte crypto hashes of sequential numbers > >> 5. The scanner is set to fetch a FULL row at a time (scan.setBatch) > >> and is set to fetch 100MB of data at a time (scan.setCaching) > >> 6. Changing the caching size seems to have no effect on the total scan > >> time at all > >> 7. The column family is setup to keep a single version of the cells, > >> no compression, and no block cache. > >> > >> Am I missing something ? Is there a way to optimize this ? > >> > >> I guess a general question I have is whether HBase is good datastore > >> for storing many medium sized (~50GB), dense datasets with lots of > >> columns when a lot of the queries require full table scans ? > >> > >> Thanks! > >> Gurjeet > >> > >> > --20cf303346875c74e704c7198efd--