hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jonathan Gray" <jl...@streamy.com>
Subject Re: Web BI with HBase : Use Case
Date Tue, 12 May 2009 00:37:16 GMT
Vertica and Greenplum are RDBMS-based solutions that cost lots of money
and solve a bunch of problems out of the box.

HBase is free and doesn't do as much for you.  It takes care of the
distribution and fault-tolerance.  It provides multi-dimensional sorted
maps to store your data in.  There are no real secondary indexes, joins,
sql syntax, etc.

I would need to hear more about your project and more details about the
example you provided, but it seems that you could utilize HBase (possibly
in conjunction with lucene/katta and mapreduce).  I myself am using Katta
and HBase together right now and have some similar use cases I will be
exploring in the future, though my dataset is a bit smaller.

HBase will push more of the work towards the application layer.  Since
you're already using katta, it doesn't seem you mind too much about
dealing with things at that level.  But HBase alone may not give you
enough indexing.  But you can accomplish a significant amount with
denormalization and mapreduce.  With a 10 second response time, if you
kept the data you needed to deal with in a single row, or within
neighboring rows, you could read through a huge number of columns/values
and do client side and server side (via filters) processing.


On Mon, May 11, 2009 3:56 pm, dotnetmetal wrote:
> Hey there,
> I've been working with Hadoop for about a year now, and have recently
> been tasked with our new metadata storange and analysis platform. I'm
> looking for your advice into what I should research, and if HBase is right
> for our use cases.
> Currently, we're collecting documents onto our Hadoop cluster, and then
> indexing them with Lucene (and Katta). Documents have attributes like a
> create date, author, bodytext, domain, etc.
> We're looking at 20TB of data to start with, growing by a few dozen a
> day.
> I'm researching the best way to provide BI on top of this data that our
> customers can "Slice and Dice" on. HBase has some appealing
> characteristics, but I'm not sure if it's *quite* what we need, since
> latency is an issue. Lucene has great indexing, but we're also going to be
> adding metadata constantly and performing schema changes.
> Here's a use case:
> A customer searches for a keyword in our web UI and a list of a few
> hundred thousand documents is returned. The customer would then like to
> select a few random authors from those documents for a certain date range
> (let's say 4
> months), and get a count of documents per author. A few hours later, these
>  documents are tagged with some more metadata... say, PageRank of the
> parent domain. The user can use this data as part of his queries as well.
> We'd like
> to have a response time of 10 seconds or so.
> I don't care much about storage space, so denormalization is totally
> fine. Is this a problem we can tackle in HBase or another open source
> distributed DB?
> A company called "Vertica" claims to be able to do this, but I wasn't
> very impressed with their architecture. "Greenplum" also looks
> interesting, but I haven't researched them much yet.
> Thanks for all your help!

View raw message