Mailing-List: contact hbase-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hbase-user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of jgray@facebook.com designates
 69.63.179.25 as permitted sender)
From: Jonathan Gray <jgray@facebook.com>
To: "hbase-user@hadoop.apache.org" <hbase-user@hadoop.apache.org>
Date: Fri, 2 Apr 2010 09:21:09 -0700
Subject: RE: How to do "Group By" in HBase
Thread-Topic: How to do "Group By" in HBase
Thread-Index: AcrSJrtFqzfujvObQVqkJpNhyIH1sQAWXqyQ
Message-ID: 
 <8D66B74984F9564BBB25C3C67D630F2D6671F758@SC-MBXC1.TheFacebook.com>
References: 
 <SNT124-W4666FA20E69DAA0D3DC4AC01D0@phx.gbl>,<8D66B74984F9564BBB25C3C67D630F2D665E394D@SC-MBXC1.TheFacebook.com>
 <SNT124-W969720E21DEE8AEA961BBC01C0@phx.gbl>
In-Reply-To: <SNT124-W969720E21DEE8AEA961BBC01C0@phx.gbl>
Accept-Language: en-US
Content-Language: en-US
acceptlanguage: en-US
Content-Type: text/plain; charset="iso-2022-jp"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0

Row=3Dproduct:zip:day ?

Basically you can create additional tables with other keys to give yourselv=
es the aggregates you need.  You'll need to decide how many to make.  With =
the above row, you could actually get grouping by state by scanning a range=
 of zips.  But if that's not efficient enough, then you'd need to create pr=
oduct:state:day.  Denormalization is the name of the game (for now).

> -----Original Message-----
> From: Sean [mailto:seanatpurdue@hotmail.com]
> Sent: Thursday, April 01, 2010 10:38 PM
> To: hbase-user@hadoop.apache.org
> Subject: RE: How to do "Group By" in HBase
>=20
>=20
>=20
>=20
> > From: jgray@facebook.com
> > To: hbase-user@hadoop.apache.org
> > Date: Thu, 1 Apr 2010 09:34:17 -0700
> > Subject: RE: How to do "Group By" in HBase
> >
> > For 1/2, it seems that your row key design is ideal for those
> queries.  You say it's inefficient because you need to scan the "whole
> session of data" containing hammer... but wouldn't you always have to
> do that unless you were doing some kind of summary/rollups?  Even in a
> relational database you scan lots of stuff!
> >
> > I can't say I fully understand query #3.  You select a single product
> and single store, but then group by product and store?  Is that group
> by required if you just want a single SUM() result?  Sorry my SQL is
> rusty :)
> >
> > If you're really just after counts of stuff, you could potentially
> keep counters rather than scanning all the original source data.  Check
> out HTable.incrementColumnValue().  You could have a table with similar
> rows but that were rolled up on different dimensions.
> >
> > For example, if you wanted to roll-up counts for each product by each
> day, you could have row=3Dproduct:day and then could have a single column
> or column for each store.  In that column you would use
> incrementColumnValue to increment for each num().  Then you could do a
> single row query to get all of the counts for that product for that
> day, by store.  If you wanted for a range of days, you could scan the
> days you need, but this would be more efficient than scanning the
> actual orders since each cell is just a counter.
>=20
> Thank you, Jonathan. I see your suggestion here, which basically have
> row=3Dproduct:day and column=3Dstore. That's a great schema answering the
> above question. However, I have a question following that (I did not
> describe the whole picture my problem):
> Stores belong to different zip codes; zip codes belong to different
> counties; counties belong to different states.	-- I want to be able to
> query each zip code's sales in a range of dates, and I want to query
> each state's sales in a range of dates. How can I get different level
> of aggregation efficiently? (This seems to fall into OLAP cube problem
> -- moreover, my original problem only holds 3-dimension in its key, but
> what if my problem holds N-dimension where N>=3D4?)
> Thanks,Sean
>=20
>=20
> > Make sense?
> >
> > JG
> >
> > > -----Original Message-----
> > > From: Sean [mailto:seanatpurdue@hotmail.com]
> > > Sent: Thursday, April 01, 2010 1:46 AM
> > > To: hbase-user@hadoop.apache.org
> > > Subject: How to do "Group By" in HBase
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > I have the follow kind of data (a typical store sell record)=1B$B!'=
=1B(B
> > > {product, date, store_name} --> number
> > >
> > > I understand that if I choose the following row key design, I will
> be
> > > able to quickly GROUP BY store_name.
> > >
> > > row key -- product:date:store_name
> > > column name -- number
> > >
> > > In other words, I can efficiently achieve the following logic (just
> a
> > > HBase scan) -- adjacent scan.
> > > 1) SELECT SUM(num) FROM sale_history_table where product=3D"hammer"
> GROUP
> > > BY product
> > > 2) SELECT SUM(num) FROM sale_history_table where product=3D"hammer",
> > > date=3D"12/04/2009" GROUP BY product date
> > >
> > > However, it's very inefficient to do the following thing because to
> > > achieve this, I basically need to scan the whole session of data
> that
> > > containing "hammer"
> > >
> > > 3) SELECT SUM(num) FROM sale_history_table where product=3D"hammer",
> > > store_name=3D"SFO_AIRPORT" GROUP BY product store_name
> > >
> > > Can someone give me an advice on what I should design my HBase
> schema
> > > if I choose to use native Hbase (I am thinking a second table may
> help
> > > case 3, but have not come up with an idea)?
> > >
> > >
> > >
> > > ( I understand Zohmg is good at these kind of problem, but I'd
> rather
> > > choose it as the last resort)
> > >
> > > Thanks,
> > > Sean
> > >
> > > The New Busy is not the old busy. Search, chat and e-mail from your
> > > inbox. Get started.
> > > _________________________________________________________________
> > > Hotmail is redefining busy with tools for the New Busy. Get more
> from
> > > your inbox.
> > >
> http://www.windowslive.com/campaign/thenewbusy?ocid=3DPID27925::T:WLMTAGL
> > > :ON:WL:en-US:WM_HMP:032010_2
>=20
> _________________________________________________________________
> The New Busy think 9 to 5 is a cute idea. Combine multiple calendars
> with Hotmail.
> http://www.windowslive.com/campaign/thenewbusy?tile=3Dmulticalendar&ocid=
=3D
> PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_5