hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Purtell <apurt...@apache.org>
Subject Re: HBase and Web-Scale BI
Date Fri, 27 Feb 2009 16:28:51 GMT
I have done something like this for a different domain but with
similar scale and user demands. The analysts by necessity needed
to specify their queries in advance and we periodically ran
mapreduce jobs to materialize into a cache (another HBase table)
new results as new fresh data arrived. Serving answers out of
cache of course then was very fast. Because we were precomputing
answers the analysts needed to apply some forethought and
discipline, query capacity had to be rationed, workflows had to
tolerate slightly out of date information, and despite all of
these necessary "drawbacks" the system was quite successful. 

Without precomputing the answer to such queries I don't see how
one can present an assembly of such information sourced from TB
(or PB) of data in less than 10 seconds. The essential strategy
here is shifting computation in time and trading cheap disk for
probably impossible CPU and index I/O demands for would be real-
time queries.

Maybe someone else can speak up if they think I am being too
pessimistic here.

Hope this helps,

   - Andy

> From: Bradford Stephens <bradfordstephens@gmail.com>
> Subject: Re: HBase and Web-Scale BI
> To: hbase-user@hadoop.apache.org
> Date: Thursday, February 26, 2009, 4:05 PM
> Sure, here we go! I'm not at all opposed to indexing
> tables, etc. I just
> want this thing to be fast and non-klugdy.
> 
> Basically, we're getting social media (like Blogs),
> normalizing the data
> into fields, and then doing BI on that.
> 
> Our data is pretty simple ... here's an example:
> 
> Document:
> 
> BodyText (string)
> BodyText Keywords (Lucene indexed)
> URL (indexed, key with collection time?)
> ParentDocumentID
> Post Date (datetime)
> Author Name (indexed, string)
> Post Topic (string)
> BodyLinks (list of URLs, possibly indexed?)
> 
> 
> An example query our user would build in the web interface
> might be, "What
> are the top 15 keywords for all documents from Feb 1st -
> April 10th where
> the author is one of these five people".   We would
> need to aggregate this
> data and have it presented in no more than 10 seconds.
> 
> We're expecting dozens of TB of data, perhaps more...
> 
> 
> On Thu, Feb 26, 2009 at 1:52 PM, Ryan Rawson
> <ryanobjc@gmail.com> wrote:
> 
> > I may have misspoke somewhat - hbase is actually quite
> good at random
> > reads.  But the catch is, it can only randomly read
> via the row id.  It's
> > more or less akin to having a DB table with only a
> index primary key, and
> > no
> > secondary indexes.
> >
> > So, yes, random reads and "index scans"
> work, and work well.  You just have
> > to handle the index creation and maintenance yourself.
> >
> > -ryan
> >
> > On Thu, Feb 26, 2009 at 12:06 PM, Jonathan Gray
> <jlist@streamy.com> wrote:
> >
> > > Bradford,
> > >
> > > Many of us probably have some input but it's
> really difficult to help
> > > without having more detail.
> > >
> > > Can you be more specific about the layout of the
> data and the queries
> > you'd
> > > want to run?
> > >
> > > HBase is efficient at scanning (as with hdfs),
> but also efficient at
> > > randomly accessing by row key.  If you need to
> fetch based on column
> > names
> > > or values, then hbase will not be efficient
> without some form of
> > secondary
> > > indexing (additional tables in hbase or something
> external like lucene).
> > >
> > > JG
> > >
> > > > -----Original Message-----
> > > > From: Bradford Stephens
> [mailto:bradfordstephens@gmail.com]
> > > > Sent: Thursday, February 26, 2009 10:37 AM
> > > > To: hbase-user@hadoop.apache.org
> > > > Subject: Re: HBase and Web-Scale BI
> > > >
> > > > Yes, it seems that the fundamental
> 'differentness' of HDFS/MapReduce is
> > > > that
> > > > they're not very well suited to random
> access -- I was hoping HBase had
> > > > found a way 'around' that, but of
> course that 'differentness' is a
> > > > fundamental strength of the HDFS way of
> doing things.
> > > >
> > > > Where things have gotten murky is that our
> data is very simple -- we
> > > > just
> > > > have a lot of it. And we don't need to
> do a *lot* of random access to
> > > > our
> > > > data -- it really doesn't feel like an
> RDBMS situation.
> > > >
> > > > Perhaps if we made an index out of a hash of
> each of our data values,
> > > > and
> > > > did some 'normalization',  that
> could be the key. Or maybe the metadata
> > > > is
> > > > not going to be as large as I thought...
> hrm.
> > > >
> > > > I appreciate the input, and hope more people
> will chime in :)
> > > >
> > > > On Wed, Feb 25, 2009 at 10:18 PM, Ryan
> Rawson <ryanobjc@gmail.com>
> > > > wrote:
> > > >
> > > > > Hey,
> > > > >
> > > > > You have to be clear about what hbase
> does and does not do.  HBase is
> > > > just
> > > > > not a rational database - it's
> "weakness" is it's strength.
> > > > >
> > > > > In general, you can only access rows in
> key order.  Keys are stored
> > > > > lexicographically sorted however. 
> There aren't declarative secondary
> > > > > indexes (minus the lucene thing, but
> that isn't an index).  You have
> > > > to put
> > > > > all these pieces together to build a
> system.
> > > > >
> > > > > But, you get scalability, and
> reasonable performance, and in 0.20 you
> > > > will
> > > > > get really good performance (fast
> enough to serve websites
> > > > hopefully).
> > > > >
> > > > > In general you need to make sure your
> row-key sorts data in the order
> > > > you
> > > > > want to query by.  You can do something
> like this:
> > > > >
> > > > > <user> <Long.MAX_VALUE -
> System.currentTimeMillis()> <event id>
> > > > >
> > > > > to store events in reverse
> chronological order by users.
> > > > >
> > > > > If you want another access method, you
> need to use a map-reduce and
> > > > build a
> > > > > secondary index.
> > > > >
> > > > > I dont know if this exactly answers
> your question, but hopefully
> > > > should
> > > > > give
> > > > > you more of an idea of what hbase does
> and does not do.
> > > > >
> > > > > -ryan
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Wed, Feb 25, 2009 at 9:02 PM,
> Bradford Stephens <
> > > > > bradfordstephens@gmail.com> wrote:
> > > > >
> > > > > > Greetings,
> > > > > >
> > > > > > I'm in charge of the data
> analysis and collection platform at my
> > > > company,
> > > > > > and we're basing a large part
> of our core analysis platform on
> > > > Hadoop,
> > > > > > Nutch, and Lucene -- it's a
> delight to use. However, we're going to
> > > > be
> > > > > > wanting some on-demand
> "web-scale" business intelligence, and I'm
> > > > > wondering
> > > > > > if HBase is the right solution --
> my research hasn't given me any
> > > > > > conclusions.
> > > > > >
> > > > > > Our data set is pretty simple -- a
> bunch of XML documents which
> > > > have been
> > > > > > parsed from HTML pages, and some
> associated data (Author Name, Post
> > > > Date,
> > > > > > Influence, etc). What we would
> like to be able to do is have our
> > > > end
> > > > > users
> > > > > > do real-time (< 10 seconds)
> OLAP-type analysis on this, and have it
> > > > > > presented on a webpage. For
> example, queries like ("All authors for
> > > > the
> > > > > > past
> > > > > > two weeks who have used these
> keywords in the post bodies and what
> > > > their
> > > > > > influence score is"). I
> imagine we'll have several terabytes of
> > > > data to
> > > > > go
> > > > > > through, and we won't be able
> to do much pre-population of results.
> > > > > >
> > > > > > Is HBase low-latency enough that
> we can scale-out to solve these
> > > > sorts of
> > > > > > problems?
> > > > > >
> > > > > > Cheers,
> > > > > > Bradford
> > > > > >
> > > > >
> > >
> > >
> >


      

Mime
View raw message