hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <otis_gospodne...@yahoo.com>
Subject Re: Jumping to row and scan forward?
Date Sat, 06 Mar 2010 02:37:54 GMT
Hi J-D,

----- Original Message ----
> From: Jean-Daniel Cryans <jdcryans@apache.org>
> To: hbase-user@hadoop.apache.org
> Sent: Fri, March 5, 2010 5:38:03 PM
> Subject: Re: Jumping to row and scan forward?
> Otis,
> What you're basically saying is: is there a way to sequentially scan
> random row keys?

Hmmmm.... no.  I'm wondering if there is a way to first *jump* to a row with a given key and
then scan to the end from there.
For example, imagine keys:

And imagine that some job went through these rows.  It got to the last row, row with key 666.
 This key 666 got stored somewhere as "this is the last key we saw".
After that happens, some more rows get added, so now we have this:
666  <=== last seen

Then, 15 minutes later, the job starts again and wants to process only the new data.  That
is, only rows after row with key 666.
So how can we do that efficiently?
Can we say "jump to key=666 and then scan from there forward"?
Or do we have to start from the very beginning of the table every time, looking for row with
key 666, ignoring all rows until we find this row 666 and processing only rows after 666.

My "worry" is that we have to start from the beginning every time and filter many-many-many
so I'm wondering if jumping directly to a specific key and then doing a scan from there is

> I can't think of an awesome answer... sequential insert could make
> sense depending on how much data you have to write per day, there's
> stuff that can be optimized to make it work better. Also you could
> write the data to 2 tables and only process the second one... which
> you clear afterwards (maybe actually keep 2 tables just for that since
> while you process one you want to write to the other).

Yeah, I was thinking something with multiple tables (one big/archive one and another small
one for new data) might work, but if we can jump to a specific key and then scan, that is
even better.


> J-D
> On Fri, Mar 5, 2010 at 1:50 PM, Otis Gospodnetic
> wrote:
> > Hi,
> >
> > I need to process (with a MR job) data stored in HBase.  The data is added to 
> HBase incrementally (and stored in there forever) and so I'd like this MR job to 
> process only the new data every time it runs.  The row keys are not timestamps 
> (because we know what this does to performance of bulk puts), but rather random 
> identifiers.  To process only the new data each time the MR job runs, the 
> *timestamp* (stored in one of the columns in each row) is stored elsewhere as 
> "timestamp of the last processed/seen row" and the MR job uses a server-side 
> filter to zip through all previously processed by filtering (skipping) rows 
> where ts < stored ts.
> >
> > Jean-Daniel Cryans suggested this 2-3 months ago here:
> > 
> http://search-hadoop.com/m?id=31a243e70912242347k55ffc527w344c9fe2842fe363@mail.gmail.com
> >
> > I say "zip", but this still means going through millions and millions and 
> hundreds of millions of rows.
> >
> > Is there *anything* in HBase that would allow one to skip/jump to (or near!) 
> the "last processed/seen row" and scan from there on, instead of always having 
> to scan from the very beginning?
> >
> > Thanks,
> > Otis
> > ----
> > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> > Hadoop ecosystem search :: http://search-hadoop.com/
> >
> >

View raw message