Mailing-List: contact hbase-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hbase-user@hadoop.apache.org
Received-SPF: neutral (athena.apache.org: local policy)
Message-ID: <891689.89646.qm@web65513.mail.ac4.yahoo.com>
References: <6c89f6801002281156m8a02642hb219a6e0bb350ba9@mail.gmail.com>
 <860544ed1003012036j3f83483dh688732894ea21c4@mail.gmail.com>
Date: Mon, 1 Mar 2010 22:38:14 -0800 (PST)
From: Andrew Purtell <apurtell@apache.org>
Subject: Re: Handling Interactive versus Batch Calculations
To: hbase-user@hadoop.apache.org
In-Reply-To: <860544ed1003012036j3f83483dh688732894ea21c4@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii

> I think Jonathan Gray began working on something similar to this a few
> months ago for Streamy.

Regrettably that was proprietary and remains so to the best of my knowledge. 

> As JD said, Coprocessors are very interesting, and I think they're
> worth looking at (or contributing a patch fo!) 

Amen to that. I've been working on this part time but my attention is split three
ways wrt. HBase at the moment.

There is simple server side in-process MapReduce implemented. See the patch on 
HBASE-2001. What is currently missing is client side support to dispatch such an
MapReduce job on a table to all of the region servers and for collecting/aggregating
the results.

Also the server side implementation holds all intermediate values in the heap. What
we have now is a sketch that needs some work. It really should spill intermediates
to local disk (as HFiles) as necessary and then read/merge them back in. We need
something like the LRU block cache but for managing globally the heap use of 
MapReduce intermediate values so they don't blow out the region server heap.

Also I need to integrate filters with coprocessors.

Also I need to work on the code weaving aspect -- weaving CPU and memory policy 
limits as the coprocessor code is loaded from jars on HDFS on demand.

This work will eventually get done but patches/contributions are most welcome! 

   - Andy


----- Original Message ----
> From: Bradford Stephens <bradfordstephens@gmail.com>
> To: hbase-user@hadoop.apache.org
> Sent: Tue, March 2, 2010 12:36:36 PM
> Subject: Re: Handling Interactive versus Batch Calculations
> 
> Hey Nenshad --
> 
> I think Jonathan Gray began working on something similar to this a few
> months ago for Streamy.
> 
> As JD said, Coprocessors are very interesting, and I think they're
> worth looking at (or contributing a patch fo!) if you basically need
> to use HBase as a "Giant Spreadsheet". Such as:
> (Row,Column)->Value->Result. Building the functionality is a
> considerable task, so I don't think you'll see it in a release from
> the main contributors soon. I could be wrong.
> 
> If you need to do a real-time query/calculation on a certain subset of
> data, that's where our platform may help. Such as "Sum of all
> transactions where UserName=Jimmy and ZipCode=98104".
> 
> I'd be happy to talk more about Coprocessors if you want more details :)
> 
> 
> Cheers,
> Bradford
> 
> 
> On Sun, Feb 28, 2010 at 11:56 AM, Nenshad Bardoliwalla
> wrote:
> > Hello All,
> >
> > This is my first message to the list, so please feel free to refer me to
> > other posts, blogs, etc. to get me up to speed.  I understand that HBase and
> > MapReduce work side-by-side to each other, that is, that they can feed each
> > other data.  I have two sets of use cases for my application: one which
> > requires batch style calculations in parallel, which MapReduce is perfect
> > for, and one which requires interactive calculations, which I'm not sure how
> > to accomplish in HBase.  By interactive calculation, I mean that a user
> > makes a request to HBase which requires some data transformation of the data
> > in HDFS (say an aggregation or an allocation) and wants the results returned
> > immediately.  Here are my questions:
> >
> > 1.  What is the mechanism by which you can build your own calculations that
> > return results quickly in HBase?  Is it just Java classes or some other
> > technique.
> > 2.  For these types of calculations, does HBase handle acquiring the data if
> > its distributed across multiple boxes like MapReduce does, or do I have to
> > write my own algorithms that seek out the data on the write nodes?
> > 3.  Is it possible to break-up the work across multiple nodes and then bring
> > it together like a MapReduce, but without the performance penalty of using
> > the MapReduce framework?  In other words, if HBase knows that files A-D are
> > on node 1, E-G are on node 2, can I write a function that says "sum up X on
> > node 1 locally and y on node 2 locally" and bring it back to me combined?
> > 4.  Are there ways to guarantee that the computation will happen in-memory
> > on the local column store, or is this the only place that such calculations
> > happen?
> >
> > Apologies for what must be very basic questions.  Any pointers really
> > appreciated.  Thank you.
> >
> > Best Regards,
> >
> > Nenshad
> >
> > --
> > Nenshad D. Bardoliwalla
> > Twitter: http://twitter.com/nenshad
> > Book: http://www.driventoperform.net
> > Blog: http://bardoli.blogspot.com
> >
> 
> 
> 
> -- 
> http://www.drawntoscalehq.com --  The intuitive, cloud-scale data
> solution. Process, store, query, search, and serve all your data.
> 
> http://www.roadtofailure.com -- The Fringes of Scalability, Social
> Media, and Computer Science