Mailing-List: contact user-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hbase.apache.org
Received-SPF: pass (nike.apache.org: domain of ivarley@salesforce.com
 designates 64.18.3.100 as permitted sender)
From: Ian Varley <ivarley@salesforce.com>
To: "user@hbase.apache.org" <user@hbase.apache.org>
Date: Thu, 8 Nov 2012 05:46:19 -0800
Subject: Re: Nosqls schema design
Thread-Topic: Nosqls schema design
Thread-Index: Ac29t28D1Rz2zB+qShGmw5jvvs2qqA==
Message-ID: <9F1A83F4-7796-4038-9CA6-5F5921354D0F@salesforce.com>
References: <loom.20121108T095122-178@post.gmane.org>
In-Reply-To: <loom.20121108T095122-178@post.gmane.org>
Accept-Language: en-US
Content-Language: en-US
acceptlanguage: en-US
Content-Type: multipart/alternative;
	boundary="_000_9F1A83F4779640389CA65F5921354D0Fsalesforcecom_"
MIME-Version: 1.0

--_000_9F1A83F4779640389CA65F5921354D0Fsalesforcecom_
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable

Hi Nick,

The key question to ask about this use case is the access pattern. Do you n=
eed real-time access to new information as it is created? (I.e. if someone =
reads an article, do your queries need to immediately reflect that?) If not=
, and a batch approach is fine (say, nightly processing) then Hadoop is a g=
ood first step; it can map/reduce over the logs and aggregate / index the d=
ata, either creating static reports like you say below for all the users & =
pages, or inserting the data into a database.

If, on the other hand, you need real-time ingest & reporting, that's where =
HBase would be a better fit. You could write the code so that every "log" y=
ou write is actually passed to an API that inserts the data into one or mor=
e HBase tables in real-time.

The trick is that since HBase doesn't have built-in secondary indexing, you=
'd have to write the data in ways that could provide low latency responses =
to the queries below. That likely means denormalizing the data; in your cas=
e, probably one table that's keyed by page & time ("all users that have bee=
n to the webpage X in the last N days"), another that's keyed by user & tim=
e ("all the pages seen by a given user"), etc. So in other words, every act=
ion would result in multiple writes into HBase. (This is the same thing tha=
t happens in a relational database--an "index" means you're writing the dat=
a in two places--but there, it's hidden from you and maintained transparent=
ly). But, once you get that working, you can have very fast access to real =
time data for any of those queries.

Note: that's a lot of work. If you don't really need real-time ingest and q=
uery, Hadoop queries over the logs are much simpler (especially with tools =
like Hive, where you can write real SQL statements and it automatically tra=
nslates them into Hadoop map/reduce jobs).

Also, 10TB isn't outside of the range that a traditional database could han=
dle (given the right HW, schema design & indexing). You may find it is simp=
ler to model your problem that way, either using Hadoop as a bridge between=
 the raw log data and the database (if offline is OK) or inserting directly=
. The key benefit of going the Hadoop/HBase route is horizontal scalability=
, meaning that even if you don't know your eventual size target, you can be=
 confident that you can scale linearly by adding hardware. That's critical =
if you're Google or Facebook, but not as frequently required for smaller bu=
sinesses. Don't over-engineer ... :)

Ian

On Nov 8, 2012, at 3:00 AM, Nick maillard wrote:

Hi everyone

I'm currently testing Hbase/Hadoop in terms of performance but also in term=
s off
applicability. After some tries, and reads I'm wondering If Hbase is well f=
itted
for the current need I'm testing.

Say I had logs on websites listing users going to webpage, reading an artic=
le,
liking a piece of data, commenting or even bookmarking.
I would store these logs on a long period and for a lot of different websit=
es
and I would like to use the data with these questions:
- All users that have been to the webpage X in the last Ndays
- All users that have liked and then bookmarked a page in a range of Y days=
.
- All the pages that are commented X times in the last N days.
- All users that have commented a page W and liked a page P.
- All pages seen,liked or commented by a given user.

As you see this might a very SQL way of thinking. The way I understand the
questions being different in nature I would have different tables to answer=
 them.
Am I correct? How could this be represented and would sql be a better fit?
The data would be large around a 10 Tbytes.

regards


--_000_9F1A83F4779640389CA65F5921354D0Fsalesforcecom_--