Return-Path: Delivered-To: apmail-hadoop-hbase-user-archive@minotaur.apache.org Received: (qmail 88184 invoked from network); 27 Oct 2009 17:18:19 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 27 Oct 2009 17:18:19 -0000 Received: (qmail 68185 invoked by uid 500); 27 Oct 2009 17:18:18 -0000 Delivered-To: apmail-hadoop-hbase-user-archive@hadoop.apache.org Received: (qmail 68140 invoked by uid 500); 27 Oct 2009 17:18:18 -0000 Mailing-List: contact hbase-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hbase-user@hadoop.apache.org Delivered-To: mailing list hbase-user@hadoop.apache.org Received: (qmail 68130 invoked by uid 99); 27 Oct 2009 17:18:18 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 27 Oct 2009 17:18:18 +0000 X-ASF-Spam-Status: No, hits=-2.6 required=5.0 tests=BAYES_00,HTML_MESSAGE X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [67.195.15.162] (HELO web111406.mail.gq1.yahoo.com) (67.195.15.162) by apache.org (qpsmtpd/0.29) with SMTP; Tue, 27 Oct 2009 17:18:15 +0000 Received: (qmail 9120 invoked by uid 60001); 27 Oct 2009 17:17:54 -0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.com; s=s1024; t=1256663874; bh=oW+xnao1tYgUwwIECZxA0RoV6L3yfkl2HkZgWGrRECc=; h=Message-ID:X-YMail-OSG:Received:X-Mailer:References:Date:From:Subject:To:In-Reply-To:MIME-Version:Content-Type; b=A7JOrubv6fqFL5rGYs2s7fTkK8YBTHAkFqDM7nzgUjbyfRB98dF+uzIfRuN4H7qPqdbus4qB0W/wx72DIm04OTmSy8Sr2zP9EN4FV/MPSedzGq/VWBnNWCoqEuYOVkb9PiLlCMT6d9fPbDXsAjFFAOFov6MWjcZp7+QczJd+iTQ= DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=yahoo.com; h=Message-ID:X-YMail-OSG:Received:X-Mailer:References:Date:From:Subject:To:In-Reply-To:MIME-Version:Content-Type; b=TY9wvI//WJnCeK7wej9GP7jtSJmvVlvCuwTJGCNcC+6l999tvTXtNfdAPkfMKlX0NoyU2WzN26eRJn9O+QQA2XkSmyiX8OM7BY5ZTJ43EwjD/TU76wuE1zn4EwsO0Dg9x1R4lcNld903KMDQuuVyIkpApZQmnjAlJWSNc1AkRA0=; Message-ID: <207004.8591.qm@web111406.mail.gq1.yahoo.com> X-YMail-OSG: hxCH32sVM1lyFT.es75ZH1wLLLL1OspD_3CdiEOe4S1LljY7noMbdr7tJiNmxFnZLIlv_qZBsbVfpjUyBoTcpc9setTKhoV0xxVH1mLYVwP2NN6pkyYDXLYCCn79UzA1Z8RrJDd8gm5B1mAxGDOuBpHZkSeLg8KGe3bWdUTqC8GC7GZvRzkHKKO2xJ11zAcBbOuNvN1AX.eW8rv0oUUQ2S8F1pz3Xl03qOdwGrJUQtgsTxYHMdFZ999NHqmuWo6rOU3sNQsc2sTzBuV0VBBrxEs.REv6IteR2MxRvndObxF8QsEIEWPQpFrOiGgOqzyRFgqAYP5xlCH2tXx7JLbpJM0tTRJ11wHrJ7i9nSMBy85TtC1dJXNgBlyE0A-- Received: from [216.239.45.4] by web111406.mail.gq1.yahoo.com via HTTP; Tue, 27 Oct 2009 10:17:54 PDT X-Mailer: YahooMailRC/182.10 YahooMailWebService/0.7.361.4 References: <138128.60130.qm@web111413.mail.gq1.yahoo.com> <4ADF4420.60405@streamy.com> <749605.30431.qm@web111416.mail.gq1.yahoo.com> Date: Tue, 27 Oct 2009 10:17:54 -0700 (PDT) From: Something Something Subject: Re: HBase table design question To: hbase-user@hadoop.apache.org In-Reply-To: <749605.30431.qm@web111416.mail.gq1.yahoo.com> MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="0-266607103-1256663874=:8591" --0-266607103-1256663874=:8591 Content-Type: text/plain; charset=us-ascii No responses to this question :( Is my question that stupid, I wonder! ________________________________ From: Something Something To: hbase-user@hadoop.apache.org Sent: Wed, October 21, 2009 12:16:19 PM Subject: Re: HBase table design question Thanks, Jonathan for the reply. One quick question... So in the User table when I perform the put operation: .put("visited", "pageId", 100); .put("visited", "pageId", 200); The 100 gets overwritten with 200. Correct? So should I use... something like this... .put("visited", "pageId100", 100); .put("visited", "pageId200", 200); I guess, I am still missing something... sorry.. Please explain. Thanks. ________________________________ From: Jonathan Gray To: hbase-user@hadoop.apache.org Sent: Wed, October 21, 2009 10:25:52 AM Subject: Re: HBase table design question You're generally on the right track. In many cases, rather than using secondary indexes in the relational world, you would have multiple tables in HBase with different keys. You may not need a table for each query, but that depends on your requirements of performance and the specific details of the data patterns (how sparse or dense certain things will be). I would start with a User table and a WebPage table, keyed by their ids. The User table could have a Visited family. The WebPage table could have a VisitedBy family. Your queries could be run like this: 1) Get(table=User, row=userid, family=Visited, qualifier=WebPageID) There are a couple different ways you could model the data here. You could either put in a new version of the same qualifier for each visit, or you could make the qualifier a composite key like WebPageID+VisitStamp, so they would then be grouped together. 2) Get(table=User, row=userid, family=Visited) All qualifiers would represent all pages visited. 3) Get(table=WebPage, row=pageid, family=VisitedBy) All qualifiers would represent all users who visited. You could store multiple visits by the same user in different ways, as above. As for using hive to run these queries, that is not something I would recommend. For one, hive integration with hbase is not complete (as far as I know). Second, hive's emphasis is on batch/offline mapreduce jobs. Running the above 3 queries can be done with the HBase API directly, and efficiently. There's no need for SQL or anything like it. Hope that helps. JG Something Something wrote: > Hello, > > Trying to figure out what's the recommended way of designing tables under HBase. Let's say I need a table to gather statistics regarding user's visits to different web pages. > > In the relational database world, we could have a table with following columns: > > Primary Key (system generated) > UserId (foreign key) > WebPageId (foreign key) > VisitedDateTime & so on.... > > Basically, this table would allow us to answer (amongst many others) the following questions... > > 1) How many times a User visited a certain Page? > 2) Which web pages did a particular user visit? > 3) Which users visited a particular web page? etc etc. > > What's the best way to model this in HTable? > Since every HTable is really a distributed hashmap, does that mean I need to create 3 different HTables (HashMaps) to answer these 3 questions? > > 1) One table with (UserId + WebPageId) as the compound key? (To answer #1) > 2) One table with UserId as the key? (To answer #2) > 3) One table with WebPageId as the key? (To answer #3) > > Along with HTable should I use Hive to run queries such as #1 above? > Any help in this regard will be greatly appreciated. Thanks. > > > --0-266607103-1256663874=:8591--