Return-Path: Delivered-To: apmail-hadoop-hbase-user-archive@minotaur.apache.org Received: (qmail 58846 invoked from network); 2 Mar 2010 06:38:49 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 2 Mar 2010 06:38:49 -0000 Received: (qmail 19243 invoked by uid 500); 2 Mar 2010 06:38:46 -0000 Delivered-To: apmail-hadoop-hbase-user-archive@hadoop.apache.org Received: (qmail 19206 invoked by uid 500); 2 Mar 2010 06:38:46 -0000 Mailing-List: contact hbase-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hbase-user@hadoop.apache.org Delivered-To: mailing list hbase-user@hadoop.apache.org Received: (qmail 19198 invoked by uid 99); 2 Mar 2010 06:38:46 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 02 Mar 2010 06:38:46 +0000 X-ASF-Spam-Status: No, hits=1.2 required=10.0 tests=SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [76.13.13.235] (HELO n7.bullet.mail.ac4.yahoo.com) (76.13.13.235) by apache.org (qpsmtpd/0.29) with SMTP; Tue, 02 Mar 2010 06:38:37 +0000 Received: from [74.6.228.95] by n7.bullet.mail.ac4.yahoo.com with NNFMP; 02 Mar 2010 06:38:15 -0000 Received: from [76.13.10.169] by t2.bullet.mail.ac4.yahoo.com with NNFMP; 02 Mar 2010 06:38:15 -0000 Received: from [127.0.0.1] by omp110.mail.ac4.yahoo.com with NNFMP; 02 Mar 2010 06:38:15 -0000 X-Yahoo-Newman-Property: ymail-3 X-Yahoo-Newman-Id: 268078.41612.bm@omp110.mail.ac4.yahoo.com Received: (qmail 91087 invoked by uid 60001); 2 Mar 2010 06:38:14 -0000 Message-ID: <891689.89646.qm@web65513.mail.ac4.yahoo.com> X-YMail-OSG: I0Y9SSoVM1liCBkBQ5MD3xDD5AxBu4yN4VwhZ9qE0WnqaMK.Ucy4PGWIxgoX0NV3vRx7FmsW.ado3k_KQZgiilriBcbzGBU0C1EWSeIYRx0Pp6OM7UVgn5KGuxA_KM2OY8E3G1Hi_KHfelo.NPIO6tIlEPJHRV1lPRQQtzfNuXNnxEdciTEGuipBCiOmpaKu9ACHpP8Sf9atZtoPObg96ul2xUSPyeFZm3a9Nh3pGYTu9Nmq9Jmrw1o5lLGo8kw8tkd1sJOjkYa0OEpJTlDbStzfmmrW4G7T1rXgijQEXn7UyF2STuVPSh3xhN79UK8qjNdsHsEZUUHu23gn8jaXu4X8_1seIw-- Received: from [60.251.45.163] by web65513.mail.ac4.yahoo.com via HTTP; Mon, 01 Mar 2010 22:38:14 PST X-RocketYMMF: apurtell X-Mailer: YahooMailRC/300.3 YahooMailWebService/0.8.100.260964 References: <6c89f6801002281156m8a02642hb219a6e0bb350ba9@mail.gmail.com> <860544ed1003012036j3f83483dh688732894ea21c4@mail.gmail.com> Date: Mon, 1 Mar 2010 22:38:14 -0800 (PST) From: Andrew Purtell Subject: Re: Handling Interactive versus Batch Calculations To: hbase-user@hadoop.apache.org In-Reply-To: <860544ed1003012036j3f83483dh688732894ea21c4@mail.gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii > I think Jonathan Gray began working on something similar to this a few > months ago for Streamy. Regrettably that was proprietary and remains so to the best of my knowledge. > As JD said, Coprocessors are very interesting, and I think they're > worth looking at (or contributing a patch fo!) Amen to that. I've been working on this part time but my attention is split three ways wrt. HBase at the moment. There is simple server side in-process MapReduce implemented. See the patch on HBASE-2001. What is currently missing is client side support to dispatch such an MapReduce job on a table to all of the region servers and for collecting/aggregating the results. Also the server side implementation holds all intermediate values in the heap. What we have now is a sketch that needs some work. It really should spill intermediates to local disk (as HFiles) as necessary and then read/merge them back in. We need something like the LRU block cache but for managing globally the heap use of MapReduce intermediate values so they don't blow out the region server heap. Also I need to integrate filters with coprocessors. Also I need to work on the code weaving aspect -- weaving CPU and memory policy limits as the coprocessor code is loaded from jars on HDFS on demand. This work will eventually get done but patches/contributions are most welcome! - Andy ----- Original Message ---- > From: Bradford Stephens > To: hbase-user@hadoop.apache.org > Sent: Tue, March 2, 2010 12:36:36 PM > Subject: Re: Handling Interactive versus Batch Calculations > > Hey Nenshad -- > > I think Jonathan Gray began working on something similar to this a few > months ago for Streamy. > > As JD said, Coprocessors are very interesting, and I think they're > worth looking at (or contributing a patch fo!) if you basically need > to use HBase as a "Giant Spreadsheet". Such as: > (Row,Column)->Value->Result. Building the functionality is a > considerable task, so I don't think you'll see it in a release from > the main contributors soon. I could be wrong. > > If you need to do a real-time query/calculation on a certain subset of > data, that's where our platform may help. Such as "Sum of all > transactions where UserName=Jimmy and ZipCode=98104". > > I'd be happy to talk more about Coprocessors if you want more details :) > > > Cheers, > Bradford > > > On Sun, Feb 28, 2010 at 11:56 AM, Nenshad Bardoliwalla > wrote: > > Hello All, > > > > This is my first message to the list, so please feel free to refer me to > > other posts, blogs, etc. to get me up to speed. I understand that HBase and > > MapReduce work side-by-side to each other, that is, that they can feed each > > other data. I have two sets of use cases for my application: one which > > requires batch style calculations in parallel, which MapReduce is perfect > > for, and one which requires interactive calculations, which I'm not sure how > > to accomplish in HBase. By interactive calculation, I mean that a user > > makes a request to HBase which requires some data transformation of the data > > in HDFS (say an aggregation or an allocation) and wants the results returned > > immediately. Here are my questions: > > > > 1. What is the mechanism by which you can build your own calculations that > > return results quickly in HBase? Is it just Java classes or some other > > technique. > > 2. For these types of calculations, does HBase handle acquiring the data if > > its distributed across multiple boxes like MapReduce does, or do I have to > > write my own algorithms that seek out the data on the write nodes? > > 3. Is it possible to break-up the work across multiple nodes and then bring > > it together like a MapReduce, but without the performance penalty of using > > the MapReduce framework? In other words, if HBase knows that files A-D are > > on node 1, E-G are on node 2, can I write a function that says "sum up X on > > node 1 locally and y on node 2 locally" and bring it back to me combined? > > 4. Are there ways to guarantee that the computation will happen in-memory > > on the local column store, or is this the only place that such calculations > > happen? > > > > Apologies for what must be very basic questions. Any pointers really > > appreciated. Thank you. > > > > Best Regards, > > > > Nenshad > > > > -- > > Nenshad D. Bardoliwalla > > Twitter: http://twitter.com/nenshad > > Book: http://www.driventoperform.net > > Blog: http://bardoli.blogspot.com > > > > > > -- > http://www.drawntoscalehq.com -- The intuitive, cloud-scale data > solution. Process, store, query, search, and serve all your data. > > http://www.roadtofailure.com -- The Fringes of Scalability, Social > Media, and Computer Science