Mailing-List: contact user-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hbase.apache.org
Received-SPF: pass (nike.apache.org: domain of michael_segel@hotmail.com
 designates 65.55.34.79 as permitted sender)
Message-ID: <COL117-W3952442DAABAB2367E33A78F3D0@phx.gbl>
Content-Type: multipart/alternative;
	boundary="_a5ed3440-2c95-43c9-b772-b5abf8c5fac2_"
From: Michael Segel <michael_segel@hotmail.com>
To: <user@hbase.apache.org>
Subject: RE: Something like Execution Plan as in the RDBMS world?
Date: Thu, 4 Aug 2011 17:05:14 -0500
Importance: Normal
In-Reply-To: <84B5E4309B3B9F4ABFF7664C3CD7698302D0DCEF@kairo.scch.at>
References: <84B5E4309B3B9F4ABFF7664C3CD7698302D0DCBE@kairo.scch.at>
 <1311747890.58504.YahooMailNeo@web65501.mail.ac4.yahoo.com>,<84B5E4309B3B9F4ABFF7664C3CD7698302D0DCEF@kairo.scch.at>
MIME-Version: 1.0

--_a5ed3440-2c95-43c9-b772-b5abf8c5fac2_
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable


Tomas=2C

If I understand you correctly you have a row key of A=2CB=2CC and you wan t=
o fetch only the rows on A and C=20
You can do a start row of A=20
And then do the end row of A1

So that you get the first row for the give vehicle_id=2C and then stop when=
 the vehicle_id changes.

You would then have to do a server side filter on values for C to get the t=
imestamp for a given day.
(You could do this with a client side filter=2C but that means pushing all =
the data over the wire.)=20
[Note having said that=2C you could just do a client side filter since you =
only have 115K rows and you're going to get a subset of that returned by th=
e range key.]

The idea of doing something like the following:
SELECT *=20
FROM TABLE=20
WHERE A=3Dx
AND DAY(C) =3D y [or some variation]
{A and C are part of a composite index}

doesn't work in HBase.

If your key was ACB=2C meaning that Vehicle_id=2C timestamp=2C device_id  w=
as the composite key=2C then you could do a start/stop range scan using A a=
nd C.

Sorry if I'm missing something since I jumped in the middle of a discussion=
.

-Mike


> Subject: RE: Something like Execution Plan as in the RDBMS world?
> Date: Thu=2C 4 Aug 2011 12:57:12 +0200
> From: Thomas.Steinmaurer@scch.at
> To: user@hbase.apache.org=3B apurtell@apache.org
>=20
> Hi Andy and Ted!
>=20
> Thanks for your reply. Basically=2C I'm currently trying a range scan and=
 a regex row filter on a very small table (~ 115K rows)=2C just to get used=
 to. Hadoop/HBase ... is running in the available Cloudera VM.
>=20
> I have the following row key=2C as already discussed in other threads.
>=20
> vehicle_id: up to 16 characters
> device_id: up to 16 characters
> timestamp: YYYYMMDDhhmmss
>=20
> Pretty much one row every 5 minutes for a particular vehicle and device.
>=20
> Now I want to get the rows for an entire day for a particular vehicle and=
 device.
>=20
> The following range scan implementation:
>=20
> 	Scan scan =3D new Scan()=3B
>=20
> 	String startKey =3D
> 		String.format(HBASE_ROWKEY_DATASOURCEID_FORMAT=2C "57").replace(' '=2C =
'0') // Vehicle ID
> 		+ "-"
> 		+ String.format(HBASE_ROWKEY_DATASOURCEID_FORMAT=2C "1").replace(' '=2C=
 '0') // Device ID
> 		+ "-"
> 		+ "20110808000000"=3B
> 	String endKey =3D
> 		String.format(HBASE_ROWKEY_DATASOURCEID_FORMAT=2C "57").replace(' '=2C =
'0') // Vehicle ID
> 		+ "-"
> 		+ String.format(HBASE_ROWKEY_DATASOURCEID_FORMAT=2C "1").replace(' '=2C=
 '0') // Device ID
> 		+ "-"
> 		+ "20110808235959"=3B
> 	scan.setStartRow(Bytes.toBytes(startKey))=3B
> 	scan.setStopRow(Bytes.toBytes(endKey))=3B
> 	scan.addColumn(Bytes.toBytes("data_details")=2C Bytes.toBytes("temperatu=
re1_value"))=3B
>=20
> Takes < 1 sec.
>=20
> Whereas the following regex based row filter implementation:
>=20
> 	List<Filter> filters =3D new ArrayList<Filter>()=3B
> 	RowFilter rf =3D new RowFilter(
> 		CompareFilter.CompareOp.EQUAL
> 		=2C new RegexStringComparator(".{14}57\\-.{15}1\\-20110808.{6}")
> 	)=3B
> 	filters.add(rf)=3B
> =09
> 	QualifierFilter qf =3D new QualifierFilter(
> 		CompareFilter.CompareOp.EQUAL
> 		=2C new RegexStringComparator("temperature1_value")
> 	)=3B
> 	filters.add(qf)=3B
> =09
> 	FilterList filterList1 =3D new FilterList(filters)=3B
> 	scan.setFilter(filterList1)=3B
>=20
>=20
> Takes around 6 sec on a very small table.
>=20
>=20
> We aren't sure if we need the regex row filter capabilities at all or if =
range scans are sufficient for our access pattern. But a better understandi=
ng on how to optimize regex stuff would be helpful.
>=20
>=20
> Thanks!
>=20
> Thomas
>=20
>=20
> -----Original Message-----
> From: Andrew Purtell [mailto:apurtell@apache.org]=20
> Sent: Mittwoch=2C 27. Juli 2011 08:25
> To: user@hbase.apache.org
> Subject: Re: Something like Execution Plan as in the RDBMS world?
>=20
> > Or is this a complete different thinking?
>=20
> Yes.
>=20
> There isn't an "execution plan" when using HBase=2C as that term is commo=
nly understood from RDBMS systems. The commands you issue against HBase usi=
ng the client API are executed in order as you issue them.
>=20
> > Depending on the access pattern=2C we might be in a situation to use=20
> >e.g. RegEx filters on rowkeys. I wonder if there is some kind of an=20
> >execution plan when running a HBase query to better understand
>=20
> Exposing filter statistics (hit/skip ratio etc.) and other per-query metr=
ics like number of store files read=2C how many keys examined=2C etc. is an=
 interesting idea perhaps along the lines of what you ask=2C but HBase does=
 not have support for that level of query performance introspection at the =
moment.=20
>=20
> What people do is measure the application metrics of interest and try dif=
ferent approaches to optimize them.
>=20
> Best regards=2C
>=20
>=20
>    - Andy
>=20
> Problems worthy of attack prove their worth by hitting back. - Piet Hein =
(via Tom White)
>=20
>=20
> >________________________________
> >From: Steinmaurer Thomas <Thomas.Steinmaurer@scch.at>
> >To: user@hbase.apache.org
> >Sent: Tuesday=2C July 26=2C 2011 11:10 PM
> >Subject: Something like Execution Plan as in the RDBMS world?
> >
> >Hello=2C
> >
> >
> >
> >we have a three part row-key taking into account that the first part is=
=20
> >important for distribution/partitioning when the system grows.=20
> >Depending on the access pattern=2C we might be in a situation to use e.g=
.=20
> >RegEx filters on rowkeys. I wonder if there is some kind of an=20
> >execution plan (as known in RDBMS) when running a HBase query to better=
=20
> >understand how HBase processes the query and what execution path it=20
> >takes to generate the result set.
> >
> >
> >
> >Or is this a complete different thinking?
> >
> >
> >
> >Thanks=2C
> >
> >Thomas
> >
> >
> >
> >
> >
> >
 		 	   		  =

--_a5ed3440-2c95-43c9-b772-b5abf8c5fac2_--