Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 8E94C6DA4 for ; Thu, 4 Aug 2011 22:05:47 +0000 (UTC) Received: (qmail 98004 invoked by uid 500); 4 Aug 2011 22:05:46 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 97928 invoked by uid 500); 4 Aug 2011 22:05:45 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 97920 invoked by uid 99); 4 Aug 2011 22:05:45 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 04 Aug 2011 22:05:44 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of michael_segel@hotmail.com designates 65.55.34.79 as permitted sender) Received: from [65.55.34.79] (HELO col0-omc2-s5.col0.hotmail.com) (65.55.34.79) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 04 Aug 2011 22:05:35 +0000 Received: from COL117-W39 ([65.55.34.72]) by col0-omc2-s5.col0.hotmail.com with Microsoft SMTPSVC(6.0.3790.4675); Thu, 4 Aug 2011 15:05:14 -0700 Message-ID: Content-Type: multipart/alternative; boundary="_a5ed3440-2c95-43c9-b772-b5abf8c5fac2_" X-Originating-IP: [65.167.11.254] From: Michael Segel To: Subject: RE: Something like Execution Plan as in the RDBMS world? Date: Thu, 4 Aug 2011 17:05:14 -0500 Importance: Normal In-Reply-To: <84B5E4309B3B9F4ABFF7664C3CD7698302D0DCEF@kairo.scch.at> References: <84B5E4309B3B9F4ABFF7664C3CD7698302D0DCBE@kairo.scch.at> <1311747890.58504.YahooMailNeo@web65501.mail.ac4.yahoo.com>,<84B5E4309B3B9F4ABFF7664C3CD7698302D0DCEF@kairo.scch.at> MIME-Version: 1.0 X-OriginalArrivalTime: 04 Aug 2011 22:05:14.0414 (UTC) FILETIME=[964DB4E0:01CC52F2] X-Virus-Checked: Checked by ClamAV on apache.org --_a5ed3440-2c95-43c9-b772-b5abf8c5fac2_ Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Tomas=2C If I understand you correctly you have a row key of A=2CB=2CC and you wan t= o fetch only the rows on A and C=20 You can do a start row of A=20 And then do the end row of A1 So that you get the first row for the give vehicle_id=2C and then stop when= the vehicle_id changes. You would then have to do a server side filter on values for C to get the t= imestamp for a given day. (You could do this with a client side filter=2C but that means pushing all = the data over the wire.)=20 [Note having said that=2C you could just do a client side filter since you = only have 115K rows and you're going to get a subset of that returned by th= e range key.] The idea of doing something like the following: SELECT *=20 FROM TABLE=20 WHERE A=3Dx AND DAY(C) =3D y [or some variation] {A and C are part of a composite index} doesn't work in HBase. If your key was ACB=2C meaning that Vehicle_id=2C timestamp=2C device_id w= as the composite key=2C then you could do a start/stop range scan using A a= nd C. Sorry if I'm missing something since I jumped in the middle of a discussion= . -Mike > Subject: RE: Something like Execution Plan as in the RDBMS world? > Date: Thu=2C 4 Aug 2011 12:57:12 +0200 > From: Thomas.Steinmaurer@scch.at > To: user@hbase.apache.org=3B apurtell@apache.org >=20 > Hi Andy and Ted! >=20 > Thanks for your reply. Basically=2C I'm currently trying a range scan and= a regex row filter on a very small table (~ 115K rows)=2C just to get used= to. Hadoop/HBase ... is running in the available Cloudera VM. >=20 > I have the following row key=2C as already discussed in other threads. >=20 > vehicle_id: up to 16 characters > device_id: up to 16 characters > timestamp: YYYYMMDDhhmmss >=20 > Pretty much one row every 5 minutes for a particular vehicle and device. >=20 > Now I want to get the rows for an entire day for a particular vehicle and= device. >=20 > The following range scan implementation: >=20 > Scan scan =3D new Scan()=3B >=20 > String startKey =3D > String.format(HBASE_ROWKEY_DATASOURCEID_FORMAT=2C "57").replace(' '=2C = '0') // Vehicle ID > + "-" > + String.format(HBASE_ROWKEY_DATASOURCEID_FORMAT=2C "1").replace(' '=2C= '0') // Device ID > + "-" > + "20110808000000"=3B > String endKey =3D > String.format(HBASE_ROWKEY_DATASOURCEID_FORMAT=2C "57").replace(' '=2C = '0') // Vehicle ID > + "-" > + String.format(HBASE_ROWKEY_DATASOURCEID_FORMAT=2C "1").replace(' '=2C= '0') // Device ID > + "-" > + "20110808235959"=3B > scan.setStartRow(Bytes.toBytes(startKey))=3B > scan.setStopRow(Bytes.toBytes(endKey))=3B > scan.addColumn(Bytes.toBytes("data_details")=2C Bytes.toBytes("temperatu= re1_value"))=3B >=20 > Takes < 1 sec. >=20 > Whereas the following regex based row filter implementation: >=20 > List filters =3D new ArrayList()=3B > RowFilter rf =3D new RowFilter( > CompareFilter.CompareOp.EQUAL > =2C new RegexStringComparator(".{14}57\\-.{15}1\\-20110808.{6}") > )=3B > filters.add(rf)=3B > =09 > QualifierFilter qf =3D new QualifierFilter( > CompareFilter.CompareOp.EQUAL > =2C new RegexStringComparator("temperature1_value") > )=3B > filters.add(qf)=3B > =09 > FilterList filterList1 =3D new FilterList(filters)=3B > scan.setFilter(filterList1)=3B >=20 >=20 > Takes around 6 sec on a very small table. >=20 >=20 > We aren't sure if we need the regex row filter capabilities at all or if = range scans are sufficient for our access pattern. But a better understandi= ng on how to optimize regex stuff would be helpful. >=20 >=20 > Thanks! >=20 > Thomas >=20 >=20 > -----Original Message----- > From: Andrew Purtell [mailto:apurtell@apache.org]=20 > Sent: Mittwoch=2C 27. Juli 2011 08:25 > To: user@hbase.apache.org > Subject: Re: Something like Execution Plan as in the RDBMS world? >=20 > > Or is this a complete different thinking? >=20 > Yes. >=20 > There isn't an "execution plan" when using HBase=2C as that term is commo= nly understood from RDBMS systems. The commands you issue against HBase usi= ng the client API are executed in order as you issue them. >=20 > > Depending on the access pattern=2C we might be in a situation to use=20 > >e.g. RegEx filters on rowkeys. I wonder if there is some kind of an=20 > >execution plan when running a HBase query to better understand >=20 > Exposing filter statistics (hit/skip ratio etc.) and other per-query metr= ics like number of store files read=2C how many keys examined=2C etc. is an= interesting idea perhaps along the lines of what you ask=2C but HBase does= not have support for that level of query performance introspection at the = moment.=20 >=20 > What people do is measure the application metrics of interest and try dif= ferent approaches to optimize them. >=20 > Best regards=2C >=20 >=20 > - Andy >=20 > Problems worthy of attack prove their worth by hitting back. - Piet Hein = (via Tom White) >=20 >=20 > >________________________________ > >From: Steinmaurer Thomas > >To: user@hbase.apache.org > >Sent: Tuesday=2C July 26=2C 2011 11:10 PM > >Subject: Something like Execution Plan as in the RDBMS world? > > > >Hello=2C > > > > > > > >we have a three part row-key taking into account that the first part is= =20 > >important for distribution/partitioning when the system grows.=20 > >Depending on the access pattern=2C we might be in a situation to use e.g= .=20 > >RegEx filters on rowkeys. I wonder if there is some kind of an=20 > >execution plan (as known in RDBMS) when running a HBase query to better= =20 > >understand how HBase processes the query and what execution path it=20 > >takes to generate the result set. > > > > > > > >Or is this a complete different thinking? > > > > > > > >Thanks=2C > > > >Thomas > > > > > > > > > > > > = --_a5ed3440-2c95-43c9-b772-b5abf8c5fac2_--