Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id C1525DB32 for ; Thu, 8 Nov 2012 13:46:56 +0000 (UTC) Received: (qmail 41338 invoked by uid 500); 8 Nov 2012 13:46:54 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 41098 invoked by uid 500); 8 Nov 2012 13:46:53 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 41037 invoked by uid 99); 8 Nov 2012 13:46:51 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 08 Nov 2012 13:46:51 +0000 X-ASF-Spam-Status: No, hits=-0.1 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_MED,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of ivarley@salesforce.com designates 64.18.3.100 as permitted sender) Received: from [64.18.3.100] (HELO exprod8og110.obsmtp.com) (64.18.3.100) by apache.org (qpsmtpd/0.29) with SMTP; Thu, 08 Nov 2012 13:46:41 +0000 Received: from exsfm-hub3.internal.salesforce.com ([204.14.239.238]) by exprod8ob110.postini.com ([64.18.7.12]) with SMTP ID DSNKUJu3rMQCLErpGewE2fLgntsUSVomcVFE@postini.com; Thu, 08 Nov 2012 05:46:21 PST Received: from EXSFM-MB01.internal.salesforce.com ([10.1.127.46]) by exsfm-hub3.internal.salesforce.com ([10.1.127.7]) with mapi; Thu, 8 Nov 2012 05:46:20 -0800 From: Ian Varley To: "user@hbase.apache.org" Date: Thu, 8 Nov 2012 05:46:19 -0800 Subject: Re: Nosqls schema design Thread-Topic: Nosqls schema design Thread-Index: Ac29t28D1Rz2zB+qShGmw5jvvs2qqA== Message-ID: <9F1A83F4-7796-4038-9CA6-5F5921354D0F@salesforce.com> References: In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: acceptlanguage: en-US Content-Type: multipart/alternative; boundary="_000_9F1A83F4779640389CA65F5921354D0Fsalesforcecom_" MIME-Version: 1.0 X-Virus-Checked: Checked by ClamAV on apache.org --_000_9F1A83F4779640389CA65F5921354D0Fsalesforcecom_ Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Hi Nick, The key question to ask about this use case is the access pattern. Do you n= eed real-time access to new information as it is created? (I.e. if someone = reads an article, do your queries need to immediately reflect that?) If not= , and a batch approach is fine (say, nightly processing) then Hadoop is a g= ood first step; it can map/reduce over the logs and aggregate / index the d= ata, either creating static reports like you say below for all the users & = pages, or inserting the data into a database. If, on the other hand, you need real-time ingest & reporting, that's where = HBase would be a better fit. You could write the code so that every "log" y= ou write is actually passed to an API that inserts the data into one or mor= e HBase tables in real-time. The trick is that since HBase doesn't have built-in secondary indexing, you= 'd have to write the data in ways that could provide low latency responses = to the queries below. That likely means denormalizing the data; in your cas= e, probably one table that's keyed by page & time ("all users that have bee= n to the webpage X in the last N days"), another that's keyed by user & tim= e ("all the pages seen by a given user"), etc. So in other words, every act= ion would result in multiple writes into HBase. (This is the same thing tha= t happens in a relational database--an "index" means you're writing the dat= a in two places--but there, it's hidden from you and maintained transparent= ly). But, once you get that working, you can have very fast access to real = time data for any of those queries. Note: that's a lot of work. If you don't really need real-time ingest and q= uery, Hadoop queries over the logs are much simpler (especially with tools = like Hive, where you can write real SQL statements and it automatically tra= nslates them into Hadoop map/reduce jobs). Also, 10TB isn't outside of the range that a traditional database could han= dle (given the right HW, schema design & indexing). You may find it is simp= ler to model your problem that way, either using Hadoop as a bridge between= the raw log data and the database (if offline is OK) or inserting directly= . The key benefit of going the Hadoop/HBase route is horizontal scalability= , meaning that even if you don't know your eventual size target, you can be= confident that you can scale linearly by adding hardware. That's critical = if you're Google or Facebook, but not as frequently required for smaller bu= sinesses. Don't over-engineer ... :) Ian On Nov 8, 2012, at 3:00 AM, Nick maillard wrote: Hi everyone I'm currently testing Hbase/Hadoop in terms of performance but also in term= s off applicability. After some tries, and reads I'm wondering If Hbase is well f= itted for the current need I'm testing. Say I had logs on websites listing users going to webpage, reading an artic= le, liking a piece of data, commenting or even bookmarking. I would store these logs on a long period and for a lot of different websit= es and I would like to use the data with these questions: - All users that have been to the webpage X in the last Ndays - All users that have liked and then bookmarked a page in a range of Y days= . - All the pages that are commented X times in the last N days. - All users that have commented a page W and liked a page P. - All pages seen,liked or commented by a given user. As you see this might a very SQL way of thinking. The way I understand the questions being different in nature I would have different tables to answer= them. Am I correct? How could this be represented and would sql be a better fit? The data would be large around a 10 Tbytes. regards --_000_9F1A83F4779640389CA65F5921354D0Fsalesforcecom_--