Return-Path: Delivered-To: apmail-hadoop-hbase-user-archive@minotaur.apache.org Received: (qmail 91473 invoked from network); 10 Dec 2009 21:07:13 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 10 Dec 2009 21:07:13 -0000 Received: (qmail 69098 invoked by uid 500); 10 Dec 2009 21:07:12 -0000 Delivered-To: apmail-hadoop-hbase-user-archive@hadoop.apache.org Received: (qmail 69045 invoked by uid 500); 10 Dec 2009 21:07:12 -0000 Mailing-List: contact hbase-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hbase-user@hadoop.apache.org Delivered-To: mailing list hbase-user@hadoop.apache.org Received: (qmail 69035 invoked by uid 99); 10 Dec 2009 21:07:12 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 10 Dec 2009 21:07:12 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of edlinuxguru@gmail.com designates 74.125.78.27 as permitted sender) Received: from [74.125.78.27] (HELO ey-out-2122.google.com) (74.125.78.27) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 10 Dec 2009 21:07:04 +0000 Received: by ey-out-2122.google.com with SMTP id 4so120825eyf.23 for ; Thu, 10 Dec 2009 13:06:44 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type :content-transfer-encoding; bh=0iP3I7HvKgGM7QrjepSt8BrqzvaovRJz8KIpR0SSzfQ=; b=GchcPMkE19z1UcGjQ1rM312ei2OqsCmYVUU65XWXEvotU4GkIrv3yhXR9SAUWzVvTM Si5dFNvvHkuJCSOurdJy7vNhfzNC4kNNZyPcjyElFyC2WWWQZ4+ca8dKyExGXyNRuciH RCGub1NFXOQjvLI5WohBkHnIfE8Ld8C1hoLag= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=RroHouRzIPDOo26DLO4Wro9H19xcxYoJkmu1uG5xYwmIxtrYioEShggRuA3bUESruT WgvsvBcQALTNuiOSMjyGlG9tLDKBG311/hvNlorShXaa00kEd/H3S682Kh0Qtdvh5QhP w0XYlyxYloWCKkCDkoXD2AbBpBh6P5xcFj2rc= MIME-Version: 1.0 Received: by 10.213.25.79 with SMTP id y15mr3206712ebb.74.1260479204229; Thu, 10 Dec 2009 13:06:44 -0800 (PST) In-Reply-To: <7c962aed0912082043v6127556dx39e4927a337f257c@mail.gmail.com> References: <78568af10912081421v302d19f9vb8127f0e8b869ebc@mail.gmail.com> <470585.47255.qm@web65501.mail.ac4.yahoo.com> <7c962aed0912082043v6127556dx39e4927a337f257c@mail.gmail.com> Date: Thu, 10 Dec 2009 16:06:44 -0500 Message-ID: Subject: Re: PrefixFilter performance question. From: Edward Capriolo To: hbase-user@hadoop.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org On Tue, Dec 8, 2009 at 11:43 PM, stack wrote: > Try using this filter instead: > > =A0 =A0 =A0scan.setFilter(FirstKeyOnlyFilter.new()) > > Will only return row keys, if thats the effect you are looking for. > > St.Ack > > > On Tue, Dec 8, 2009 at 3:30 PM, Edward Capriolo wr= ote: > >> On Tue, Dec 8, 2009 at 6:00 PM, Andrew Purtell >> wrote: >> > I added an entry to the troubleshooting page up on the wiki: >> > >> > =A0 =A0http://wiki.apache.org/hadoop/Hbase/Troubleshooting#A16 >> > >> > =A0- Andy >> > >> > >> > >> > >> > >> > ________________________________ >> > From: Ryan Rawson >> > To: hbase-user@hadoop.apache.org >> > Sent: Tue, December 8, 2009 5:21:25 PM >> > Subject: Re: PrefixFilter performance question. >> > >> > You want: >> > >> > >> http://hadoop.apache.org/hbase/docs/r0.20.2/api/org/apache/hadoop/hbase/= client/HTable.html#scannerCaching >> > >> > The default is low because if a job takes too long processing, a >> > scanner can time out, which causes unhappy jobs/people/emails. >> > >> > BTW I can read small rows out of a 19 node cluster at 7 million >> > rows/sec using a map-reduce program. =A0Any individual process is doin= g >> > 40k+ rows/sec or so >> > >> > -ryan >> > >> > On Tue, Dec 8, 2009 at 12:25 PM, Edward Capriolo >> wrote: >> >> Hey all, >> >> >> >> I have been doing some performance evaluation with mysql vs hbase. >> >> >> >> I have a table webtable >> >> {NAME =3D> 'webdata', FAMILIES =3D> [{NAME =3D> 'anchor', COMPRESSION= =3D> >> >> 'NONE', VERSIONS =3D> '3', TTL =3D> '2147483647', BLOCKSIZE =3D> '655= 36', >> >> IN_MEMORY =3D> 'false', BLOCKCACHE =3D> 'true'}, {NAME =3D> 'image', >> >> COMPRESSION =3D> 'NONE', VERSIONS =3D> '3', TTL =3D> '2147483647', BL= OCKSIZE >> >> =3D> '65536', IN_MEMORY =3D> 'false', BLOCKCACHE =3D> 'true'}, {NAME = =3D> >> >> 'raw_data', COMPRESSION =3D> 'NONE', VERSIONS =3D> '3', TTL =3D> >> >> '2147483647', BLOCKSIZE =3D> '65536', IN_MEMORY =3D> 'false', BLOCKCA= CHE >> >> =3D> 'true'}]} >> >> >> >> I have a normalized version in mysql. I currently have loaded >> >> >> >> nyhadoopdev6:60030 =A0 =A0 =A01260289750689 =A0 requests=3D4, regions= =3D3, >> usedHeap=3D99, maxHeap=3D997 >> >> nyhadoopdev7:60030 =A0 =A0 =A01260289862481 =A0 requests=3D0, regions= =3D2, >> usedHeap=3D181, >> >> maxHeap=3D997 >> >> nyhadoopdev8:60030 =A0 =A0 =A01260289909059 =A0 requests=3D0, regions= =3D2, >> usedHeap=3D395, >> >> maxHeap=3D997 >> >> >> >> This is a snippet here. >> >> >> >> if (mysql) { >> >> =A0 =A0 =A0 try { >> >> =A0 =A0 =A0 =A0PreparedStatement ps =3D conn.prepareStatement("SELECT= * FROM >> >> page WHERE page LIKE (?)"); >> >> =A0 =A0 =A0 =A0ps.setString(1,"http://www.s%"); >> >> =A0 =A0 =A0 =A0ResultSet rs =3D ps.executeQuery(); >> >> =A0 =A0 =A0 =A0while (rs.next() ){ >> >> =A0 =A0 =A0 =A0 =A0sPageCount++; >> >> =A0 =A0 =A0 =A0} >> >> =A0 =A0 =A0 =A0rs.close(); >> >> =A0 =A0 =A0 =A0ps.close(); >> >> =A0 =A0 =A0 } catch (SQLException ex) {System.out.println(ex); System= .exit(1); >> } >> >> =A0 =A0 =A0} >> >> >> >> =A0 =A0 =A0if (hbase) { >> >> =A0 =A0 =A0 =A0Scan s =3D new Scan(); >> >> =A0 =A0 =A0 =A0//s.setCacheBlocks(true); >> >> =A0 =A0 =A0 =A0s.setFilter( new PrefixFilter(Bytes.toBytes("http://ww= w.s") ) ); >> >> =A0 =A0 =A0 =A0ResultScanner scanner =3D table.getScanner(s); >> >> =A0 =A0 =A0 =A0try { >> >> =A0 =A0 =A0 =A0 =A0for (Result rr:scanner){ >> >> =A0 =A0 =A0 =A0 =A0 =A0sPageCount++; >> >> =A0 =A0 =A0 =A0 =A0} >> >> =A0 =A0 =A0 } finally { >> >> =A0 =A0 =A0 =A0 scanner.close(); >> >> =A0 =A0 =A0 } >> >> >> >> =A0 =A0 =A0} >> >> >> >> I am seeing about .3 MS from mysql and 20. second performance from >> >> Hbase. I have read some tuning docs but most seem geared for insertio= n >> >> speed, not search speed. I would think this would be a >> >> Bread-and-butter search for hbase since the row keys are naturally >> >> sorted lexicographically. I am not running a giant setup here, 3 >> >> nodes, 2x replication, but I would think that it is almost a non >> >> factor here since these data is fairly small. Hints ? >> >> >> > >> > >> > >> > >> >> I raised this to from 1-30 -> 18 sec >> I raised this to 100 ->17 sec >> I raised this to 1000 ->OOM >> >> The OOM pointed me in the direction that this comparison is not apples >> to apples. In mysql the page table is normalized, but in HBASE it is >> not. I see lots of data moving across the wire. >> >> I tried to filter to just move the ROW key across the wire but I do >> not think I have it right... >> >> =A0List filters =3D new ArrayList(); >> =A0 =A0 =A0 =A0filters.add( new PrefixFilter(Bytes.toBytes("http://www.s= ") ) ) ; >> =A0 =A0 =A0 =A0filters.add( new QualifierFilter( CompareOp.EQUAL, new >> BinaryComparator( >> >> Bytes.toBytes("ROW")) ) ); >> =A0 =A0 =A0 =A0Filter f =3Dnew FilterList(Operator.MUST_PASS_ALL, filter= s); >> =A0 =A0 =A0 =A0s.setFilter(f); >> =A0 =A0 =A0 =A0 ResultScanner scanner =3D table.getScanner(s); >> > I have added the smallest family I have. s.addFamily( Bytes.toBytes("anchor") ) This drops the search to spage_time:2266 ms second consecutive search takes ~1000 ms This is more reasonable, the time discrepancy now could be explained because each entry has 5-10 random anchors associated with it. I have used CE HBase 0.20.0 RPM. and guess what I do not have? FirstKeyOnlyFilter :) I really like the layout Hbase layout/init scripts this RPM provides. I can't seem to find the src.rpm for it anywhere. If I do not find it in a few days, I might just to latest or trunk. (side note Does anyone have the source RPM?)