Return-Path: Delivered-To: apmail-hadoop-hbase-user-archive@minotaur.apache.org Received: (qmail 97499 invoked from network); 2 Apr 2010 16:24:39 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 2 Apr 2010 16:24:39 -0000 Received: (qmail 1723 invoked by uid 500); 2 Apr 2010 16:24:39 -0000 Delivered-To: apmail-hadoop-hbase-user-archive@hadoop.apache.org Received: (qmail 1567 invoked by uid 500); 2 Apr 2010 16:24:38 -0000 Mailing-List: contact hbase-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hbase-user@hadoop.apache.org Delivered-To: mailing list hbase-user@hadoop.apache.org Received: (qmail 1559 invoked by uid 99); 2 Apr 2010 16:24:38 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 02 Apr 2010 16:24:38 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=10.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of jgray@facebook.com designates 69.63.179.25 as permitted sender) Received: from [69.63.179.25] (HELO mailout-snc1.facebook.com) (69.63.179.25) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 02 Apr 2010 16:24:33 +0000 Received: from mail.thefacebook.com ([192.168.18.198]) by pp01.snc1.tfbnw.net (8.14.3/8.14.3) with ESMTP id o32GNpwi031937 (version=TLSv1/SSLv3 cipher=AES128-SHA bits=128 verify=NOT) for ; Fri, 2 Apr 2010 09:23:54 -0700 Received: from sc-hub06.TheFacebook.com (192.168.18.83) by sc-hub03.TheFacebook.com (192.168.18.198) with Microsoft SMTP Server (TLS) id 14.0.682.1; Fri, 2 Apr 2010 09:23:37 -0700 Received: from SC-MBXC1.TheFacebook.com ([192.168.18.102]) by sc-hub06.TheFacebook.com ([192.168.18.83]) with mapi; Fri, 2 Apr 2010 09:21:09 -0700 From: Jonathan Gray To: "hbase-user@hadoop.apache.org" Date: Fri, 2 Apr 2010 09:21:09 -0700 Subject: RE: How to do "Group By" in HBase Thread-Topic: How to do "Group By" in HBase Thread-Index: AcrSJrtFqzfujvObQVqkJpNhyIH1sQAWXqyQ Message-ID: <8D66B74984F9564BBB25C3C67D630F2D6671F758@SC-MBXC1.TheFacebook.com> References: ,<8D66B74984F9564BBB25C3C67D630F2D665E394D@SC-MBXC1.TheFacebook.com> In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: acceptlanguage: en-US Content-Type: text/plain; charset="iso-2022-jp" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=fsecure engine=1.12.8161:2.4.5,1.2.40,4.0.166 definitions=2010-04-02_11:2010-02-06,2010-04-02,2010-04-02 signatures=0 X-Virus-Checked: Checked by ClamAV on apache.org Row=3Dproduct:zip:day ? Basically you can create additional tables with other keys to give yourselv= es the aggregates you need. You'll need to decide how many to make. With = the above row, you could actually get grouping by state by scanning a range= of zips. But if that's not efficient enough, then you'd need to create pr= oduct:state:day. Denormalization is the name of the game (for now). > -----Original Message----- > From: Sean [mailto:seanatpurdue@hotmail.com] > Sent: Thursday, April 01, 2010 10:38 PM > To: hbase-user@hadoop.apache.org > Subject: RE: How to do "Group By" in HBase >=20 >=20 >=20 >=20 > > From: jgray@facebook.com > > To: hbase-user@hadoop.apache.org > > Date: Thu, 1 Apr 2010 09:34:17 -0700 > > Subject: RE: How to do "Group By" in HBase > > > > For 1/2, it seems that your row key design is ideal for those > queries. You say it's inefficient because you need to scan the "whole > session of data" containing hammer... but wouldn't you always have to > do that unless you were doing some kind of summary/rollups? Even in a > relational database you scan lots of stuff! > > > > I can't say I fully understand query #3. You select a single product > and single store, but then group by product and store? Is that group > by required if you just want a single SUM() result? Sorry my SQL is > rusty :) > > > > If you're really just after counts of stuff, you could potentially > keep counters rather than scanning all the original source data. Check > out HTable.incrementColumnValue(). You could have a table with similar > rows but that were rolled up on different dimensions. > > > > For example, if you wanted to roll-up counts for each product by each > day, you could have row=3Dproduct:day and then could have a single column > or column for each store. In that column you would use > incrementColumnValue to increment for each num(). Then you could do a > single row query to get all of the counts for that product for that > day, by store. If you wanted for a range of days, you could scan the > days you need, but this would be more efficient than scanning the > actual orders since each cell is just a counter. >=20 > Thank you, Jonathan. I see your suggestion here, which basically have > row=3Dproduct:day and column=3Dstore. That's a great schema answering the > above question. However, I have a question following that (I did not > describe the whole picture my problem): > Stores belong to different zip codes; zip codes belong to different > counties; counties belong to different states. -- I want to be able to > query each zip code's sales in a range of dates, and I want to query > each state's sales in a range of dates. How can I get different level > of aggregation efficiently? (This seems to fall into OLAP cube problem > -- moreover, my original problem only holds 3-dimension in its key, but > what if my problem holds N-dimension where N>=3D4?) > Thanks,Sean >=20 >=20 > > Make sense? > > > > JG > > > > > -----Original Message----- > > > From: Sean [mailto:seanatpurdue@hotmail.com] > > > Sent: Thursday, April 01, 2010 1:46 AM > > > To: hbase-user@hadoop.apache.org > > > Subject: How to do "Group By" in HBase > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I have the follow kind of data (a typical store sell record)=1B$B!'= =1B(B > > > {product, date, store_name} --> number > > > > > > I understand that if I choose the following row key design, I will > be > > > able to quickly GROUP BY store_name. > > > > > > row key -- product:date:store_name > > > column name -- number > > > > > > In other words, I can efficiently achieve the following logic (just > a > > > HBase scan) -- adjacent scan. > > > 1) SELECT SUM(num) FROM sale_history_table where product=3D"hammer" > GROUP > > > BY product > > > 2) SELECT SUM(num) FROM sale_history_table where product=3D"hammer", > > > date=3D"12/04/2009" GROUP BY product date > > > > > > However, it's very inefficient to do the following thing because to > > > achieve this, I basically need to scan the whole session of data > that > > > containing "hammer" > > > > > > 3) SELECT SUM(num) FROM sale_history_table where product=3D"hammer", > > > store_name=3D"SFO_AIRPORT" GROUP BY product store_name > > > > > > Can someone give me an advice on what I should design my HBase > schema > > > if I choose to use native Hbase (I am thinking a second table may > help > > > case 3, but have not come up with an idea)? > > > > > > > > > > > > ( I understand Zohmg is good at these kind of problem, but I'd > rather > > > choose it as the last resort) > > > > > > Thanks, > > > Sean > > > > > > The New Busy is not the old busy. Search, chat and e-mail from your > > > inbox. Get started. > > > _________________________________________________________________ > > > Hotmail is redefining busy with tools for the New Busy. Get more > from > > > your inbox. > > > > http://www.windowslive.com/campaign/thenewbusy?ocid=3DPID27925::T:WLMTAGL > > > :ON:WL:en-US:WM_HMP:032010_2 >=20 > _________________________________________________________________ > The New Busy think 9 to 5 is a cute idea. Combine multiple calendars > with Hotmail. > http://www.windowslive.com/campaign/thenewbusy?tile=3Dmulticalendar&ocid= =3D > PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_5