hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <otis_gospodne...@yahoo.com>
Subject Jumping to row and scan forward?
Date Fri, 05 Mar 2010 21:50:13 GMT
Hi,

I need to process (with a MR job) data stored in HBase.  The data is added to HBase incrementally
(and stored in there forever) and so I'd like this MR job to process only the new data every
time it runs.  The row keys are not timestamps (because we know what this does to performance
of bulk puts), but rather random identifiers.  To process only the new data each time the
MR job runs, the *timestamp* (stored in one of the columns in each row) is stored elsewhere
as "timestamp of the last processed/seen row" and the MR job uses a server-side filter to
zip through all previously processed by filtering (skipping) rows where ts < stored ts.


Jean-Daniel Cryans suggested this 2-3 months ago here:
http://search-hadoop.com/m?id=31a243e70912242347k55ffc527w344c9fe2842fe363@mail.gmail.com

I say "zip", but this still means going through millions and millions and hundreds of millions
of rows.

Is there *anything* in HBase that would allow one to skip/jump to (or near!) the "last processed/seen
row" and scan from there on, instead of always having to scan from the very beginning?

Thanks,
Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Hadoop ecosystem search :: http://search-hadoop.com/


Mime
View raw message