Return-Path: Delivered-To: apmail-hadoop-hbase-user-archive@minotaur.apache.org Received: (qmail 48028 invoked from network); 6 Mar 2010 02:46:02 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 6 Mar 2010 02:46:02 -0000 Received: (qmail 28668 invoked by uid 500); 6 Mar 2010 02:45:46 -0000 Delivered-To: apmail-hadoop-hbase-user-archive@hadoop.apache.org Received: (qmail 28613 invoked by uid 500); 6 Mar 2010 02:45:46 -0000 Mailing-List: contact hbase-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hbase-user@hadoop.apache.org Delivered-To: mailing list hbase-user@hadoop.apache.org Received: (qmail 28605 invoked by uid 99); 6 Mar 2010 02:45:45 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 06 Mar 2010 02:45:45 +0000 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests=FREEMAIL_FROM,RCVD_IN_DNSWL_NONE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of ryanobjc@gmail.com designates 209.85.223.182 as permitted sender) Received: from [209.85.223.182] (HELO mail-iw0-f182.google.com) (209.85.223.182) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 06 Mar 2010 02:45:44 +0000 Received: by iwn12 with SMTP id 12so3113863iwn.21 for ; Fri, 05 Mar 2010 18:45:24 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type :content-transfer-encoding; bh=FwmsbpeAkdDGqI44SX5gxFMoeoc2u5drlE/tpXd1YIo=; b=vvYRQFFKB2avzUVMFJ9UrKvMRECBS5aOlP9EsranXoalpQJUcmpqGZ11Prrx6nTjKP YJSc9NnPS1gIqBS0/sQk/qcCkHkU07GXz1DnD5XCzxBpzZixcvfe1WXG/zYgqH17B6tv hWSVZr6KTW6wlGVHk+ibqRPtin6Rgp1uLlBgM= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=NgJt4JV85F3pwKHejkPs0ycf7qfwtdmw3vMeTwo4Biq4apIx6Mbf84RiUhV1/NUqmj JkgXytjiVu4Q71ut9PwL/uzauM3NLtziRyYAYFcJG41Oak0fKk3hA76TqnnFi8551ANl DvLzlABXUL8I47ETLV6J8rn0WTfrgUqGS/6cM= MIME-Version: 1.0 Received: by 10.231.144.201 with SMTP id a9mr508577ibv.69.1267843524105; Fri, 05 Mar 2010 18:45:24 -0800 (PST) In-Reply-To: <882535.3241.qm@web50306.mail.re2.yahoo.com> References: <619296.12580.qm@web50308.mail.re2.yahoo.com> <31a243e71003051438x753dce31i94f212726ca9a655@mail.gmail.com> <882535.3241.qm@web50306.mail.re2.yahoo.com> Date: Fri, 5 Mar 2010 18:45:24 -0800 Message-ID: <78568af11003051845m68814e93ub3c682bd67c2686c@mail.gmail.com> Subject: Re: Jumping to row and scan forward? From: Ryan Rawson To: hbase-user@hadoop.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable I think you might want to use the Scan.setTimeRange which can be used to only get 'new' things. -ryan On Fri, Mar 5, 2010 at 6:37 PM, Otis Gospodnetic wrote: > Hi J-D, > > > ----- Original Message ---- >> From: Jean-Daniel Cryans >> To: hbase-user@hadoop.apache.org >> Sent: Fri, March 5, 2010 5:38:03 PM >> Subject: Re: Jumping to row and scan forward? >> >> Otis, >> >> What you're basically saying is: is there a way to sequentially scan >> random row keys? > > > Hmmmm.... no. =A0I'm wondering if there is a way to first *jump* to a row= with a given key and then scan to the end from there. > For example, imagine keys: > ... > 777 > 444 > 222 > 666 > > And imagine that some job went through these rows. =A0It got to the last = row, row with key 666. =A0This key 666 got stored somewhere as "this is the= last key we saw". > After that happens, some more rows get added, so now we have this: > ... > 777 > 444 > 222 > 666 =A0<=3D=3D=3D last seen > 333 > 999 > 888 > > Then, 15 minutes later, the job starts again and wants to process only th= e new data. =A0That is, only rows after row with key 666. > So how can we do that efficiently? > Can we say "jump to key=3D666 and then scan from there forward"? > Or do we have to start from the very beginning of the table every time, l= ooking for row with key 666, ignoring all rows until we find this row 666 a= nd processing only rows after 666. > > My "worry" is that we have to start from the beginning every time and fil= ter many-many-many rows, > so I'm wondering if jumping directly to a specific key and then doing a s= can from there is possible. > > >> I can't think of an awesome answer... sequential insert could make >> sense depending on how much data you have to write per day, there's >> stuff that can be optimized to make it work better. Also you could >> write the data to 2 tables and only process the second one... which >> you clear afterwards (maybe actually keep 2 tables just for that since >> while you process one you want to write to the other). > > > Yeah, I was thinking something with multiple tables (one big/archive one = and another small one for new data) might work, but if we can jump to a spe= cific key and then scan, that is even better. > > Thanks, > Otis > >> J-D >> >> On Fri, Mar 5, 2010 at 1:50 PM, Otis Gospodnetic >> wrote: >> > Hi, >> > >> > I need to process (with a MR job) data stored in HBase. =A0The data is= added to >> HBase incrementally (and stored in there forever) and so I'd like this M= R job to >> process only the new data every time it runs. =A0The row keys are not ti= mestamps >> (because we know what this does to performance of bulk puts), but rather= random >> identifiers. =A0To process only the new data each time the MR job runs, = the >> *timestamp* (stored in one of the columns in each row) is stored elsewhe= re as >> "timestamp of the last processed/seen row" and the MR job uses a server-= side >> filter to zip through all previously processed by filtering (skipping) r= ows >> where ts < stored ts. >> > >> > Jean-Daniel Cryans suggested this 2-3 months ago here: >> > >> http://search-hadoop.com/m?id=3D31a243e70912242347k55ffc527w344c9fe2842f= e363@mail.gmail.com >> > >> > I say "zip", but this still means going through millions and millions = and >> hundreds of millions of rows. >> > >> > Is there *anything* in HBase that would allow one to skip/jump to (or = near!) >> the "last processed/seen row" and scan from there on, instead of always = having >> to scan from the very beginning? >> > >> > Thanks, >> > Otis >> > ---- >> > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch >> > Hadoop ecosystem search :: http://search-hadoop.com/ >> > >> > > >