Mailing-List: contact hbase-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hbase-user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of ryanobjc@gmail.com designates
 209.85.223.182 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:date:message-id:subject:from:to
         :content-type:content-transfer-encoding;
        b=NgJt4JV85F3pwKHejkPs0ycf7qfwtdmw3vMeTwo4Biq4apIx6Mbf84RiUhV1/NUqmj
         JkgXytjiVu4Q71ut9PwL/uzauM3NLtziRyYAYFcJG41Oak0fKk3hA76TqnnFi8551ANl
         DvLzlABXUL8I47ETLV6J8rn0WTfrgUqGS/6cM=
MIME-Version: 1.0
In-Reply-To: <882535.3241.qm@web50306.mail.re2.yahoo.com>
References: <619296.12580.qm@web50308.mail.re2.yahoo.com>
	 <31a243e71003051438x753dce31i94f212726ca9a655@mail.gmail.com>
	 <882535.3241.qm@web50306.mail.re2.yahoo.com>
Date: Fri, 5 Mar 2010 18:45:24 -0800
Message-ID: <78568af11003051845m68814e93ub3c682bd67c2686c@mail.gmail.com>
Subject: Re: Jumping to row and scan forward?
From: Ryan Rawson <ryanobjc@gmail.com>
To: hbase-user@hadoop.apache.org
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

I think you might want to use the Scan.setTimeRange which can be used
to only get 'new' things.

-ryan

On Fri, Mar 5, 2010 at 6:37 PM, Otis Gospodnetic
<otis_gospodnetic@yahoo.com> wrote:
> Hi J-D,
>
>
> ----- Original Message ----
>> From: Jean-Daniel Cryans <jdcryans@apache.org>
>> To: hbase-user@hadoop.apache.org
>> Sent: Fri, March 5, 2010 5:38:03 PM
>> Subject: Re: Jumping to row and scan forward?
>>
>> Otis,
>>
>> What you're basically saying is: is there a way to sequentially scan
>> random row keys?
>
>
> Hmmmm.... no. =A0I'm wondering if there is a way to first *jump* to a row=
 with a given key and then scan to the end from there.
> For example, imagine keys:
> ...
> 777
> 444
> 222
> 666
>
> And imagine that some job went through these rows. =A0It got to the last =
row, row with key 666. =A0This key 666 got stored somewhere as "this is the=
 last key we saw".
> After that happens, some more rows get added, so now we have this:
> ...
> 777
> 444
> 222
> 666 =A0<=3D=3D=3D last seen
> 333
> 999
> 888
>
> Then, 15 minutes later, the job starts again and wants to process only th=
e new data. =A0That is, only rows after row with key 666.
> So how can we do that efficiently?
> Can we say "jump to key=3D666 and then scan from there forward"?
> Or do we have to start from the very beginning of the table every time, l=
ooking for row with key 666, ignoring all rows until we find this row 666 a=
nd processing only rows after 666.
>
> My "worry" is that we have to start from the beginning every time and fil=
ter many-many-many rows,
> so I'm wondering if jumping directly to a specific key and then doing a s=
can from there is possible.
>
>
>> I can't think of an awesome answer... sequential insert could make
>> sense depending on how much data you have to write per day, there's
>> stuff that can be optimized to make it work better. Also you could
>> write the data to 2 tables and only process the second one... which
>> you clear afterwards (maybe actually keep 2 tables just for that since
>> while you process one you want to write to the other).
>
>
> Yeah, I was thinking something with multiple tables (one big/archive one =
and another small one for new data) might work, but if we can jump to a spe=
cific key and then scan, that is even better.
>
> Thanks,
> Otis
>
>> J-D
>>
>> On Fri, Mar 5, 2010 at 1:50 PM, Otis Gospodnetic
>> wrote:
>> > Hi,
>> >
>> > I need to process (with a MR job) data stored in HBase. =A0The data is=
 added to
>> HBase incrementally (and stored in there forever) and so I'd like this M=
R job to
>> process only the new data every time it runs. =A0The row keys are not ti=
mestamps
>> (because we know what this does to performance of bulk puts), but rather=
 random
>> identifiers. =A0To process only the new data each time the MR job runs, =
the
>> *timestamp* (stored in one of the columns in each row) is stored elsewhe=
re as
>> "timestamp of the last processed/seen row" and the MR job uses a server-=
side
>> filter to zip through all previously processed by filtering (skipping) r=
ows
>> where ts < stored ts.
>> >
>> > Jean-Daniel Cryans suggested this 2-3 months ago here:
>> >
>> http://search-hadoop.com/m?id=3D31a243e70912242347k55ffc527w344c9fe2842f=
e363@mail.gmail.com
>> >
>> > I say "zip", but this still means going through millions and millions =
and
>> hundreds of millions of rows.
>> >
>> > Is there *anything* in HBase that would allow one to skip/jump to (or =
near!)
>> the "last processed/seen row" and scan from there on, instead of always =
having
>> to scan from the very beginning?
>> >
>> > Thanks,
>> > Otis
>> > ----
>> > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
>> > Hadoop ecosystem search :: http://search-hadoop.com/
>> >
>> >
>
>