hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Scott Green" <smallbad...@gmail.com>
Subject Re: HBase Design Ideas, Part II
Date Thu, 30 Nov 2006 04:02:24 GMT

I am not 100% sure HDFS/HBase is what I exactly want. And here I state
my requirement and please help me to evaluate it. I am still new to

I build one search engine using nutch, and now I need to collect other
related content  by other systems. The question is where should I
storge the *related content* and how to connect them with nutch index.

As to the storage, two choices are available: HDFS/HBase and RDMS. I'd
prefer the  first one. And at the beginning, the content are only be
appended w/o deleting, updating. And they should be retrieved by the
key(something like foreign key in RDMS, and the stuff is connected to

If you think i am on the right way, could you help to give me some
code snippets to show how to store the content? Big thanks

- Scott.

On 11/30/06, Scott Green <smallbadguy@gmail.com> wrote:
> Hi Mike
> Are you still working on HBase? and could you please introduce the
> project status to us? I have strong interesting in your idea :)
> - Scott
> On 5/16/06, Michael Cafarella <michael.cafarella@gmail.com> wrote:
> > Hi everyone,
> >
> > My previous mail mentioned a bunch of design ideas that were mainly
> > lifted from Jeff Dean's BigTable talk.  BigTable seems like a useful
> > way to do large-scale row storage, and their decisions largely seem
> > like the right ones.
> >
> > However, BigTable still leaves some things on the table.  Items to
> > improve include a query language and multi-row locking, among
> > other things.
> >
> > Dean said explicitly in his talk that they wanted to avoid multirow
> > locking because it's complicated, error-prone, and maybe not necessary.
> > He's right on at least the first two, and maybe the third.
> >
> > Multiple row locks are useful when you're making a change to
> > several rows that should be atomic; you want all the changes
> > or none of the changes.  It's also used in traditional databases
> > if you want to perform an expensive read operation (like a
> > multiway join) and you want to make sure the results don't
> > get modified while you're reading.
> >
> > Distributed lock acquisition is very hard to do.  It's bug-prone
> > and often has very weird performance ramifications.  It's
> > difficult to get working, difficult to tune, difficult to everything.
> >
> > Here are a few ideas on what to do:
> > 1)  Suck it up and have the client acquire locks on multiple
> > HRegionServers simultaneously.  All clients would have to
> > agree to acquire locks according to some global ordering to
> > avoid deadlock.  HRegions would not be allowed to migrate
> > to a new server if locked.
> >
> > If this is a rare circumstance, a better approach would be
> > to have a dedicated "lock acquirer" through which clients
> > make requests.  It doesn't help the theoretical problem here,
> > but it would make debugging an awful lot easier.
> >
> > 2)  In the case of long-lasting read operations, we can
> > use versioning to guarantee consistency.  If each row is
> > annotated with an edit timestamp, and we know that there
> > is sufficient version history available, the long-lasting job
> > can run over a specific version only.
> >
> > Edits can continue to be made to the database while the
> > read-only job is ongoing.  The operation is performed over
> > the database as of the time the task was submitted.
> >
> > 3) In the case of multiple row updates, we may be able to
> > use different edit semantics to avoid locking.  For example,
> > consider that we want to add a single column/value pair to
> > multiple rows.  We want this to happen atomically, so that
> > both rows get the value or neither of them do so.
> >
> > If it's just an add, then we don't need to lock the rows at
> > all; the add will always succeed, even if other writes
> > intervene. Traditionally there's been no difference between
> > among data "updates", so they all require locking.  If we
> > can get a client to adjust the update semantics slightly,
> > then the locking can be much more relaxed.
> >
> > I'd say that "add" or "append" semantics are likely to be
> > at least as common as "edit" semantics.
> >
> > Can you think of the family of edit semantics you'd like
> > to see offered here?
> >
> > Also, how useful do you think a general-purpose query language
> > would be for HBase?  It would be fairly straightforward to implement,
> > for example, a poor man's version of SQL that has different locking
> > and update behavior (and which chucks out the more exotic elements).
> > This might be compiled into a piece of code that is executed
> > immediately, or it might be transformed into a long-lasting mapreduce
> > job.
> >
> > I have a few ideas for such a language, but I'm worried it's getting
> > a little far afield from what we're interested in for Hadoop.
> >
> > --Mike
> >
> >

View raw message