hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean <seanatpur...@hotmail.com>
Subject RE: How to do "Group By" in HBase
Date Fri, 02 Apr 2010 05:38:00 GMT



> From: jgray@facebook.com
> To: hbase-user@hadoop.apache.org
> Date: Thu, 1 Apr 2010 09:34:17 -0700
> Subject: RE: How to do "Group By" in HBase
> 
> For 1/2, it seems that your row key design is ideal for those queries.  You say it's
inefficient because you need to scan the "whole session of data" containing hammer... but
wouldn't you always have to do that unless you were doing some kind of summary/rollups?  Even
in a relational database you scan lots of stuff!
> 
> I can't say I fully understand query #3.  You select a single product and single store,
but then group by product and store?  Is that group by required if you just want a single
SUM() result?  Sorry my SQL is rusty :)
> 
> If you're really just after counts of stuff, you could potentially keep counters rather
than scanning all the original source data.  Check out HTable.incrementColumnValue().  You
could have a table with similar rows but that were rolled up on different dimensions.
> 
> For example, if you wanted to roll-up counts for each product by each day, you could
have row=product:day and then could have a single column or column for each store.  In that
column you would use incrementColumnValue to increment for each num().  Then you could do
a single row query to get all of the counts for that product for that day, by store.  If you
wanted for a range of days, you could scan the days you need, but this would be more efficient
than scanning the actual orders since each cell is just a counter.

Thank you, Jonathan. I see your suggestion here, which basically have row=product:day and
column=store. That's a great schema answering the above question. However, I have a question
following that (I did not describe the whole picture my problem): 
Stores belong to different zip codes; zip codes belong to different counties; counties belong
to different states.	-- I want to be able to query each zip code's sales in a range of dates,
and I want to query each state's sales in a range of dates. How can I get different level
of aggregation efficiently? (This seems to fall into OLAP cube problem -- moreover, my original
problem only holds 3-dimension in its key, but what if my problem holds N-dimension where
N>=4?)
Thanks,Sean


> Make sense?
> 
> JG
> 
> > -----Original Message-----
> > From: Sean [mailto:seanatpurdue@hotmail.com]
> > Sent: Thursday, April 01, 2010 1:46 AM
> > To: hbase-user@hadoop.apache.org
> > Subject: How to do "Group By" in HBase
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > I have the follow kind of data (a typical store sell record):
> > {product, date, store_name} --> number
> > 
> > I understand that if I choose the following row key design, I will be
> > able to quickly GROUP BY store_name.
> > 
> > row key -- product:date:store_name
> > column name -- number
> > 
> > In other words, I can efficiently achieve the following logic (just a
> > HBase scan) -- adjacent scan.
> > 1) SELECT SUM(num) FROM sale_history_table where product="hammer" GROUP
> > BY product
> > 2) SELECT SUM(num) FROM sale_history_table where product="hammer",
> > date="12/04/2009" GROUP BY product date
> > 
> > However, it's very inefficient to do the following thing because to
> > achieve this, I basically need to scan the whole session of data that
> > containing "hammer"
> > 
> > 3) SELECT SUM(num) FROM sale_history_table where product="hammer",
> > store_name="SFO_AIRPORT" GROUP BY product store_name
> > 
> > Can someone give me an advice on what I should design my HBase schema
> > if I choose to use native Hbase (I am thinking a second table may help
> > case 3, but have not come up with an idea)?
> > 
> > 
> > 
> > ( I understand Zohmg is good at these kind of problem, but I'd rather
> > choose it as the last resort)
> > 
> > Thanks,
> > Sean
> > 
> > The New Busy is not the old busy. Search, chat and e-mail from your
> > inbox. Get started.
> > _________________________________________________________________
> > Hotmail is redefining busy with tools for the New Busy. Get more from
> > your inbox.
> > http://www.windowslive.com/campaign/thenewbusy?ocid=PID27925::T:WLMTAGL
> > :ON:WL:en-US:WM_HMP:032010_2
 		 	   		  
_________________________________________________________________
The New Busy think 9 to 5 is a cute idea. Combine multiple calendars with Hotmail. 
http://www.windowslive.com/campaign/thenewbusy?tile=multicalendar&ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_5
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message