hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bradford Stephens <bradfordsteph...@gmail.com>
Subject Re: HBase and Web-Scale BI
Date Fri, 27 Feb 2009 00:05:36 GMT
Sure, here we go! I'm not at all opposed to indexing tables, etc. I just
want this thing to be fast and non-klugdy.

Basically, we're getting social media (like Blogs), normalizing the data
into fields, and then doing BI on that.

Our data is pretty simple ... here's an example:

Document:

BodyText (string)
BodyText Keywords (Lucene indexed)
URL (indexed, key with collection time?)
ParentDocumentID
Post Date (datetime)
Author Name (indexed, string)
Post Topic (string)
BodyLinks (list of URLs, possibly indexed?)


An example query our user would build in the web interface might be, "What
are the top 15 keywords for all documents from Feb 1st - April 10th where
the author is one of these five people".   We would need to aggregate this
data and have it presented in no more than 10 seconds.

We're expecting dozens of TB of data, perhaps more...


On Thu, Feb 26, 2009 at 1:52 PM, Ryan Rawson <ryanobjc@gmail.com> wrote:

> I may have misspoke somewhat - hbase is actually quite good at random
> reads.  But the catch is, it can only randomly read via the row id.  It's
> more or less akin to having a DB table with only a index primary key, and
> no
> secondary indexes.
>
> So, yes, random reads and "index scans" work, and work well.  You just have
> to handle the index creation and maintenance yourself.
>
> -ryan
>
> On Thu, Feb 26, 2009 at 12:06 PM, Jonathan Gray <jlist@streamy.com> wrote:
>
> > Bradford,
> >
> > Many of us probably have some input but it's really difficult to help
> > without having more detail.
> >
> > Can you be more specific about the layout of the data and the queries
> you'd
> > want to run?
> >
> > HBase is efficient at scanning (as with hdfs), but also efficient at
> > randomly accessing by row key.  If you need to fetch based on column
> names
> > or values, then hbase will not be efficient without some form of
> secondary
> > indexing (additional tables in hbase or something external like lucene).
> >
> > JG
> >
> > > -----Original Message-----
> > > From: Bradford Stephens [mailto:bradfordstephens@gmail.com]
> > > Sent: Thursday, February 26, 2009 10:37 AM
> > > To: hbase-user@hadoop.apache.org
> > > Subject: Re: HBase and Web-Scale BI
> > >
> > > Yes, it seems that the fundamental 'differentness' of HDFS/MapReduce is
> > > that
> > > they're not very well suited to random access -- I was hoping HBase had
> > > found a way 'around' that, but of course that 'differentness' is a
> > > fundamental strength of the HDFS way of doing things.
> > >
> > > Where things have gotten murky is that our data is very simple -- we
> > > just
> > > have a lot of it. And we don't need to do a *lot* of random access to
> > > our
> > > data -- it really doesn't feel like an RDBMS situation.
> > >
> > > Perhaps if we made an index out of a hash of each of our data values,
> > > and
> > > did some 'normalization',  that could be the key. Or maybe the metadata
> > > is
> > > not going to be as large as I thought... hrm.
> > >
> > > I appreciate the input, and hope more people will chime in :)
> > >
> > > On Wed, Feb 25, 2009 at 10:18 PM, Ryan Rawson <ryanobjc@gmail.com>
> > > wrote:
> > >
> > > > Hey,
> > > >
> > > > You have to be clear about what hbase does and does not do.  HBase is
> > > just
> > > > not a rational database - it's "weakness" is it's strength.
> > > >
> > > > In general, you can only access rows in key order.  Keys are stored
> > > > lexicographically sorted however.  There aren't declarative secondary
> > > > indexes (minus the lucene thing, but that isn't an index).  You have
> > > to put
> > > > all these pieces together to build a system.
> > > >
> > > > But, you get scalability, and reasonable performance, and in 0.20 you
> > > will
> > > > get really good performance (fast enough to serve websites
> > > hopefully).
> > > >
> > > > In general you need to make sure your row-key sorts data in the order
> > > you
> > > > want to query by.  You can do something like this:
> > > >
> > > > <user> <Long.MAX_VALUE - System.currentTimeMillis()> <event
id>
> > > >
> > > > to store events in reverse chronological order by users.
> > > >
> > > > If you want another access method, you need to use a map-reduce and
> > > build a
> > > > secondary index.
> > > >
> > > > I dont know if this exactly answers your question, but hopefully
> > > should
> > > > give
> > > > you more of an idea of what hbase does and does not do.
> > > >
> > > > -ryan
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > On Wed, Feb 25, 2009 at 9:02 PM, Bradford Stephens <
> > > > bradfordstephens@gmail.com> wrote:
> > > >
> > > > > Greetings,
> > > > >
> > > > > I'm in charge of the data analysis and collection platform at my
> > > company,
> > > > > and we're basing a large part of our core analysis platform on
> > > Hadoop,
> > > > > Nutch, and Lucene -- it's a delight to use. However, we're going
to
> > > be
> > > > > wanting some on-demand "web-scale" business intelligence, and I'm
> > > > wondering
> > > > > if HBase is the right solution -- my research hasn't given me any
> > > > > conclusions.
> > > > >
> > > > > Our data set is pretty simple -- a bunch of XML documents which
> > > have been
> > > > > parsed from HTML pages, and some associated data (Author Name, Post
> > > Date,
> > > > > Influence, etc). What we would like to be able to do is have our
> > > end
> > > > users
> > > > > do real-time (< 10 seconds) OLAP-type analysis on this, and have
it
> > > > > presented on a webpage. For example, queries like ("All authors for
> > > the
> > > > > past
> > > > > two weeks who have used these keywords in the post bodies and what
> > > their
> > > > > influence score is"). I imagine we'll have several terabytes of
> > > data to
> > > > go
> > > > > through, and we won't be able to do much pre-population of results.
> > > > >
> > > > > Is HBase low-latency enough that we can scale-out to solve these
> > > sorts of
> > > > > problems?
> > > > >
> > > > > Cheers,
> > > > > Bradford
> > > > >
> > > >
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message