Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 237369583 for ; Wed, 25 Jan 2012 19:18:09 +0000 (UTC) Received: (qmail 99217 invoked by uid 500); 25 Jan 2012 19:18:07 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 99109 invoked by uid 500); 25 Jan 2012 19:18:06 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 99101 invoked by uid 99); 25 Jan 2012 19:18:06 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 25 Jan 2012 19:18:06 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of ghendrey@decarta.com designates 208.81.204.160 as permitted sender) Received: from [208.81.204.160] (HELO mx3.decarta.com) (208.81.204.160) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 25 Jan 2012 19:17:59 +0000 Received: from dct-mail.sanjose.telcontar.com ([10.253.0.17]) by mx3.decarta.com with Microsoft SMTPSVC(6.0.3790.4675); Wed, 25 Jan 2012 11:17:38 -0800 X-MimeOLE: Produced By Microsoft Exchange V6.5 Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Subject: RE: Speeding up Scans Date: Wed, 25 Jan 2012 11:17:37 -0800 Message-ID: <6C5C1804772DB944BA88A0DC48D338DA0BD9FB74@dct-mail.sanjose.telcontar.com> In-Reply-To: <4F204545.4030004@qualtrics.com> X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: Speeding up Scans Thread-Index: AczbjIWD1WzsraHeQb2o2bhbCykLWQACP10A References: <4F201A79.60300@gmail.com> <4F204545.4030004@qualtrics.com> From: "Geoff Hendrey" To: X-OriginalArrivalTime: 25 Jan 2012 19:17:38.0266 (UTC) FILETIME=[003FCBA0:01CCDB96] X-Virus-Checked: Checked by ClamAV on apache.org Sorry for jumping in late, and perhaps out of context, but I'm pasting in some findings (reported to this list by us a while back) that helped us to get scans to perform very fast. Adjusting hbase.client.prefetch.limit was critical for us.: =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D It's even more mysterious than we think. There is lack of documentation (or perhaps lack of know how). Apparently there are 2 factors that decide the performance of scan.=20 1. Scanner cache as we know - We always had scanner caching set to 1, but this is different than pre fetch limit 2. hbase.client.prefetch.limit - This is meta caching limit defaults to 10 to prefetch 10 region locations every time we scan that is not already been pre-warmed=20 the "hbase.client.prefetch.limit" is passed along to the client code to prefetch the next 10 region locations. int rows =3D Math.min(rowLimit, configuration.getInt("hbase.meta.scanner.caching", 100)); the "row" variable mins to 10 and always prefetch atmost 10 region boundaries. Hence every new region boundary that is not already been pre-warmed fetch the next 10 region locations resulting in 1st slow query followed by quick responses. This is basically pre-warming the meta not region cache. -----Original Message----- From: Jeff Whiting [mailto:jeffw@qualtrics.com]=20 Sent: Wednesday, January 25, 2012 10:09 AM To: user@hbase.apache.org Subject: Re: Speeding up Scans Does it make sense to have better defaults so the performance out of the box is better? ~Jeff On 1/25/2012 8:06 AM, Peter Wolf wrote: > Ah ha! I appear to be insane ;-) > > Adding the following speeded things up quite a bit > > scan.setCacheBlocks(true); > scan.setCaching(1000); > > Thank you, it was a duh! > > P > > > > On 1/25/12 8:13 AM, Doug Meil wrote: >> Hi there- >> >> Quick sanity check: what caching level are you using? (default is 1) I >> know this is basic, but it's always good to double-check. >> >> If "language" is already in the lead position of the rowkey, why use the >> filter? >> >> As for EC2, that's a wildcard. >> >> >> >> >> >> On 1/25/12 7:56 AM, "Peter Wolf" wrote: >> >>> Hello all, >>> >>> I am looking for advice on speeding up my Scanning. >>> >>> I want to iterate over all rows where a particular column (language) >>> equals a particular value ("JA"). >>> >>> I am already creating my row keys using that column in the first bytes. >>> And I do my scans using partial row matching, like this... >>> >>> public static byte[] calculateStartRowKey(String language) { >>> int languageHash =3D language.length()> 0 ? language.hashCode() : >>> 0; >>> byte[] language2 =3D Bytes.toBytes(languageHash); >>> byte[] accountID2 =3D Bytes.toBytes(0); >>> byte[] timestamp2 =3D Bytes.toBytes(0); >>> return Bytes.add(Bytes.add(language2, accountID2), timestamp2); >>> } >>> >>> public static byte[] calculateEndRowKey(String language) { >>> int languageHash =3D language.length()> 0 ? language.hashCode() : >>> 0; >>> byte[] language2 =3D Bytes.toBytes(languageHash + 1); >>> byte[] accountID2 =3D Bytes.toBytes(0); >>> byte[] timestamp2 =3D Bytes.toBytes(0); >>> return Bytes.add(Bytes.add(language2, accountID2), timestamp2); >>> } >>> >>> Scan scan =3D new Scan(calculateStartRowKey(language), >>> calculateEndRowKey(language)); >>> >>> >>> Since I am using a hash value for the string, I need to re-check the >>> column to make sure that some other string does not get the same hash >>> value >>> >>> Filter filter =3D new SingleColumnValueFilter(resultFamily, >>> languageCol, CompareFilter.CompareOp.EQUAL, Bytes.toBytes(language)); >>> scan.setFilter(filter); >>> >>> I am using the Cloudera 0.09.4 release, and a cluster of 3 machines on >>> EC2. >>> >>> I think that this should be really fast, but it is not. Any advice on >>> how to debug/speed it up? >>> >>> Thanks >>> Peter >>> >>> >>> >>> >>> >> > --=20 Jeff Whiting Qualtrics Senior Software Engineer jeffw@qualtrics.com