Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 4B344C7AD for ; Sun, 12 Aug 2012 23:13:30 +0000 (UTC) Received: (qmail 68537 invoked by uid 500); 12 Aug 2012 23:13:28 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 68472 invoked by uid 500); 12 Aug 2012 23:13:28 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 68464 invoked by uid 99); 12 Aug 2012 23:13:28 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 12 Aug 2012 23:13:28 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=FSL_RCVD_USER,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of whshub@gmail.com designates 209.85.216.48 as permitted sender) Received: from [209.85.216.48] (HELO mail-qa0-f48.google.com) (209.85.216.48) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 12 Aug 2012 23:13:21 +0000 Received: by qady1 with SMTP id y1so2188700qad.14 for ; Sun, 12 Aug 2012 16:13:01 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=CZERbQ/AhA9C/KqS2Sc/Yb6ldqdf4P8RQXchShIcZwE=; b=jGMag5eKBGVDBqKxGTq8NkZ5h7lhlunuZvz9BFVEp5LZg9praZGWZJGCX8gup3Q8Ut d+4kg3HJoQ630hYR9/2Fu59nempulaEPqI/rmbpC23QwpGkg7ZAzSVXKztPDG2Uljtcm F2BJFgiQr7D4TCXmDIbQBiqSQLxc7W4Tlu3KeQJB0QbIomeAWNBrJ2QGqPgJXD98Psan zrHxkkbF/BrbzcUOfRNB5TV1VgToHpyiNlfdnxX8loOUY5865QeirxVL9GyhoBA7ail+ HLEbsW9cZmA6XuyBRwiwxNnLLMD/NyBFsDBGVYEJVWEzSm5MXRsENTGQEESv5fLzTQZy VVqw== MIME-Version: 1.0 Received: by 10.224.214.69 with SMTP id gz5mr21269489qab.21.1344813181166; Sun, 12 Aug 2012 16:13:01 -0700 (PDT) Received: by 10.49.1.225 with HTTP; Sun, 12 Aug 2012 16:13:01 -0700 (PDT) In-Reply-To: References: <1344810241.53514.YahooMailNeo@web121704.mail.ne1.yahoo.com> Date: Sun, 12 Aug 2012 16:13:01 -0700 Message-ID: Subject: Re: Slow full-table scans From: Jacques To: user@hbase.apache.org Content-Type: multipart/alternative; boundary=20cf3005dd0ef3e35b04c719b7db X-Virus-Checked: Checked by ClamAV on apache.org --20cf3005dd0ef3e35b04c719b7db Content-Type: text/plain; charset=ISO-8859-1 I think the first question is where is the time spent. Does your analysis show that all the time spent is on the regionservers or is a portion of the bottleneck on the client side? Jacques On Sun, Aug 12, 2012 at 4:00 PM, Mohammad Tariq wrote: > Methods getStartKey and getEndKey provided by HRegionInfo class can used > for that purpose. > Also, please make sure, any HTable instance is not left opened once you are > are done with reads. > Regards, > Mohammad Tariq > > > > On Mon, Aug 13, 2012 at 4:22 AM, Gurjeet Singh wrote: > > > Hi Mohammad, > > > > This is a great idea. Is there a API call to determine the start/end > > key for each region ? > > > > Thanks, > > Gurjeet > > > > On Sun, Aug 12, 2012 at 3:49 PM, Mohammad Tariq > > wrote: > > > Hello experts, > > > > > > Would it be feasible to create a separate thread for each > > region??I > > > mean we can determine start and end key of each region and issue a scan > > for > > > each region in parallel. > > > > > > Regards, > > > Mohammad Tariq > > > > > > > > > > > > On Mon, Aug 13, 2012 at 3:54 AM, lars hofhansl > > wrote: > > > > > >> Do you really have to retrieve all 200.000 each time? > > >> Scan.setBatch(...) makes no difference?! (note that batching is > > different > > >> and separate from caching). > > >> > > >> Also note that the scanner contract is to return sorted KVs, so a > single > > >> scan cannot be parallelized across RegionServers (well not entirely > > true, > > >> it could be farmed off in parallel and then be presented to the client > > in > > >> the right order - but HBase is not doing that). That is why one vs 12 > > RSs > > >> makes no difference in this scenario. > > >> > > >> In the 12 node case you'll see low CPU on all but one RS, and each RS > > will > > >> get its turn. > > >> > > >> In your case this is scanning 20.000.000 KVs serially in 400s, that's > > >> 50000 KVs/s, which - depending on hardware - is not too bad for HBase > > (but > > >> not great either). > > >> > > >> If you only ever expect to run a single query like this on top your > > >> cluster (i.e. your concern is latency not throughput) you can do > > multiple > > >> RPCs in parallel for a sub portion of your key range. Together with > > >> batching can start using value before all is streamed back from the > > server. > > >> > > >> > > >> -- Lars > > >> > > >> > > >> > > >> ----- Original Message ----- > > >> From: Gurjeet Singh > > >> To: user@hbase.apache.org > > >> Cc: > > >> Sent: Saturday, August 11, 2012 11:04 PM > > >> Subject: Slow full-table scans > > >> > > >> Hi, > > >> > > >> I am trying to read all the data out of an HBase table using a scan > > >> and it is extremely slow. > > >> > > >> Here are some characteristics of the data: > > >> > > >> 1. The total table size is tiny (~200MB) > > >> 2. The table has ~100 rows and ~200,000 columns in a SINGLE family. > > >> Thus the size of each cell is ~10bytes and the size of each row is > > >> ~2MB > > >> 3. Currently scanning the whole table takes ~400s (both in a > > >> distributed setting with 12 nodes or so and on a single node), thus > > >> 5sec/row > > >> 4. The row keys are unique 8 byte crypto hashes of sequential numbers > > >> 5. The scanner is set to fetch a FULL row at a time (scan.setBatch) > > >> and is set to fetch 100MB of data at a time (scan.setCaching) > > >> 6. Changing the caching size seems to have no effect on the total scan > > >> time at all > > >> 7. The column family is setup to keep a single version of the cells, > > >> no compression, and no block cache. > > >> > > >> Am I missing something ? Is there a way to optimize this ? > > >> > > >> I guess a general question I have is whether HBase is good datastore > > >> for storing many medium sized (~50GB), dense datasets with lots of > > >> columns when a lot of the queries require full table scans ? > > >> > > >> Thanks! > > >> Gurjeet > > >> > > >> > > > --20cf3005dd0ef3e35b04c719b7db--