hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jonathan Gray <jg...@facebook.com>
Subject RE: How to do "Group By" in HBase
Date Fri, 02 Apr 2010 16:21:09 GMT
Row=product:zip:day ?

Basically you can create additional tables with other keys to give yourselves the aggregates
you need.  You'll need to decide how many to make.  With the above row, you could actually
get grouping by state by scanning a range of zips.  But if that's not efficient enough, then
you'd need to create product:state:day.  Denormalization is the name of the game (for now).

> -----Original Message-----
> From: Sean [mailto:seanatpurdue@hotmail.com]
> Sent: Thursday, April 01, 2010 10:38 PM
> To: hbase-user@hadoop.apache.org
> Subject: RE: How to do "Group By" in HBase
> 
> 
> 
> 
> > From: jgray@facebook.com
> > To: hbase-user@hadoop.apache.org
> > Date: Thu, 1 Apr 2010 09:34:17 -0700
> > Subject: RE: How to do "Group By" in HBase
> >
> > For 1/2, it seems that your row key design is ideal for those
> queries.  You say it's inefficient because you need to scan the "whole
> session of data" containing hammer... but wouldn't you always have to
> do that unless you were doing some kind of summary/rollups?  Even in a
> relational database you scan lots of stuff!
> >
> > I can't say I fully understand query #3.  You select a single product
> and single store, but then group by product and store?  Is that group
> by required if you just want a single SUM() result?  Sorry my SQL is
> rusty :)
> >
> > If you're really just after counts of stuff, you could potentially
> keep counters rather than scanning all the original source data.  Check
> out HTable.incrementColumnValue().  You could have a table with similar
> rows but that were rolled up on different dimensions.
> >
> > For example, if you wanted to roll-up counts for each product by each
> day, you could have row=product:day and then could have a single column
> or column for each store.  In that column you would use
> incrementColumnValue to increment for each num().  Then you could do a
> single row query to get all of the counts for that product for that
> day, by store.  If you wanted for a range of days, you could scan the
> days you need, but this would be more efficient than scanning the
> actual orders since each cell is just a counter.
> 
> Thank you, Jonathan. I see your suggestion here, which basically have
> row=product:day and column=store. That's a great schema answering the
> above question. However, I have a question following that (I did not
> describe the whole picture my problem):
> Stores belong to different zip codes; zip codes belong to different
> counties; counties belong to different states.	-- I want to be able to
> query each zip code's sales in a range of dates, and I want to query
> each state's sales in a range of dates. How can I get different level
> of aggregation efficiently? (This seems to fall into OLAP cube problem
> -- moreover, my original problem only holds 3-dimension in its key, but
> what if my problem holds N-dimension where N>=4?)
> Thanks,Sean
> 
> 
> > Make sense?
> >
> > JG
> >
> > > -----Original Message-----
> > > From: Sean [mailto:seanatpurdue@hotmail.com]
> > > Sent: Thursday, April 01, 2010 1:46 AM
> > > To: hbase-user@hadoop.apache.org
> > > Subject: How to do "Group By" in HBase
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > I have the follow kind of data (a typical store sell record):
> > > {product, date, store_name} --> number
> > >
> > > I understand that if I choose the following row key design, I will
> be
> > > able to quickly GROUP BY store_name.
> > >
> > > row key -- product:date:store_name
> > > column name -- number
> > >
> > > In other words, I can efficiently achieve the following logic (just
> a
> > > HBase scan) -- adjacent scan.
> > > 1) SELECT SUM(num) FROM sale_history_table where product="hammer"
> GROUP
> > > BY product
> > > 2) SELECT SUM(num) FROM sale_history_table where product="hammer",
> > > date="12/04/2009" GROUP BY product date
> > >
> > > However, it's very inefficient to do the following thing because to
> > > achieve this, I basically need to scan the whole session of data
> that
> > > containing "hammer"
> > >
> > > 3) SELECT SUM(num) FROM sale_history_table where product="hammer",
> > > store_name="SFO_AIRPORT" GROUP BY product store_name
> > >
> > > Can someone give me an advice on what I should design my HBase
> schema
> > > if I choose to use native Hbase (I am thinking a second table may
> help
> > > case 3, but have not come up with an idea)?
> > >
> > >
> > >
> > > ( I understand Zohmg is good at these kind of problem, but I'd
> rather
> > > choose it as the last resort)
> > >
> > > Thanks,
> > > Sean
> > >
> > > The New Busy is not the old busy. Search, chat and e-mail from your
> > > inbox. Get started.
> > > _________________________________________________________________
> > > Hotmail is redefining busy with tools for the New Busy. Get more
> from
> > > your inbox.
> > >
> http://www.windowslive.com/campaign/thenewbusy?ocid=PID27925::T:WLMTAGL
> > > :ON:WL:en-US:WM_HMP:032010_2
> 
> _________________________________________________________________
> The New Busy think 9 to 5 is a cute idea. Combine multiple calendars
> with Hotmail.
> http://www.windowslive.com/campaign/thenewbusy?tile=multicalendar&ocid=
> PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_5

Mime
View raw message