hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ian Varley <ivar...@salesforce.com>
Subject Re: Nosqls schema design
Date Thu, 08 Nov 2012 13:46:19 GMT
Hi Nick,

The key question to ask about this use case is the access pattern. Do you need real-time access
to new information as it is created? (I.e. if someone reads an article, do your queries need
to immediately reflect that?) If not, and a batch approach is fine (say, nightly processing)
then Hadoop is a good first step; it can map/reduce over the logs and aggregate / index the
data, either creating static reports like you say below for all the users & pages, or
inserting the data into a database.

If, on the other hand, you need real-time ingest & reporting, that's where HBase would
be a better fit. You could write the code so that every "log" you write is actually passed
to an API that inserts the data into one or more HBase tables in real-time.

The trick is that since HBase doesn't have built-in secondary indexing, you'd have to write
the data in ways that could provide low latency responses to the queries below. That likely
means denormalizing the data; in your case, probably one table that's keyed by page &
time ("all users that have been to the webpage X in the last N days"), another that's keyed
by user & time ("all the pages seen by a given user"), etc. So in other words, every action
would result in multiple writes into HBase. (This is the same thing that happens in a relational
database--an "index" means you're writing the data in two places--but there, it's hidden from
you and maintained transparently). But, once you get that working, you can have very fast
access to real time data for any of those queries.

Note: that's a lot of work. If you don't really need real-time ingest and query, Hadoop queries
over the logs are much simpler (especially with tools like Hive, where you can write real
SQL statements and it automatically translates them into Hadoop map/reduce jobs).

Also, 10TB isn't outside of the range that a traditional database could handle (given the
right HW, schema design & indexing). You may find it is simpler to model your problem
that way, either using Hadoop as a bridge between the raw log data and the database (if offline
is OK) or inserting directly. The key benefit of going the Hadoop/HBase route is horizontal
scalability, meaning that even if you don't know your eventual size target, you can be confident
that you can scale linearly by adding hardware. That's critical if you're Google or Facebook,
but not as frequently required for smaller businesses. Don't over-engineer ... :)


On Nov 8, 2012, at 3:00 AM, Nick maillard wrote:

Hi everyone

I'm currently testing Hbase/Hadoop in terms of performance but also in terms off
applicability. After some tries, and reads I'm wondering If Hbase is well fitted
for the current need I'm testing.

Say I had logs on websites listing users going to webpage, reading an article,
liking a piece of data, commenting or even bookmarking.
I would store these logs on a long period and for a lot of different websites
and I would like to use the data with these questions:
- All users that have been to the webpage X in the last Ndays
- All users that have liked and then bookmarked a page in a range of Y days.
- All the pages that are commented X times in the last N days.
- All users that have commented a page W and liked a page P.
- All pages seen,liked or commented by a given user.

As you see this might a very SQL way of thinking. The way I understand the
questions being different in nature I would have different tables to answer them.
Am I correct? How could this be represented and would sql be a better fit?
The data would be large around a 10 Tbytes.


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message