From "Geoff Hendrey" <ghend...@decarta.com>
Subject RE: Get operation in HBase Map-Reduce methods
Date Tue, 20 Apr 2010 17:16:51 GMT
As I understand it, you have a table, and you need to do some set of
operations on a subset of the rowIDs in the table. One idea (and I'm new
to this too), would be to create a temporary table, and write the rows
in question into it. Then you can use a TableMapper on the temp table.
Another way would be to create a SequenceFile with the RowIDs, then use
a SequenceFileInputFormat to drive your mapreduce job, and do Get
operations to read the rows from within your mapper method. And a third
idea would be to override getSplits in TableInputFormat in such a way
that you scan the table, and create splits only where you have blocks of
contiguous rows that you need to process. This idea probably only makes
sense if in fact the rows you need to process are not randomly
distributed, but rather occur in ranges.


Going back to the OP's question... using get() within a M/R, the answer
is yes.

However you have a problem in that you need to have to somehow determine
which row_id you want to retrieve.

Since you're starting with a list of row_ids, then that should be the
source for your m/r.
So you'd have to work out your mapper to take the data from this list as
your source and then within each m/r 's setup(), you connect to HBase to
be used in each iteration of map().

I have a process where I scan one column family in a table, and based on
information in the record, I have to perform a get() so what you want to
do is possible in a M/R.  

I don't have a good code example for your specific use case. The issue
isn't in connecting to hbase or doing the get. (That's trivial) The hard
part is writing a mapper that takes a list in memory as its input

Now here's the point where someone from Cloudera, Yahoo! or somewhere
else says that even that piece is trivial and here's how to do it. :-)


> On Tue, Apr 20, 2010 at 8:39 AM, Andrey <atimerbaev@gmx.net> wrote:
> > Dear All,
> >
> > Assumed, I've got a list of rowIDs of a HBase table. I want to get 
> > each row by its rowID, do some operations with its values, and store

> > the results somewhere subsequently. Is there a good way to do this
in a Map-Reduce manner?
> >
> > As far as I understand, a mapper usually takes a Scan to form 
> > inputs. It is quite possible to create such a Scan, which contains a

> > lot of RowFilters to be EQUAL to a particular <rowId>. Such a 
> > strategy will work for sure, however is inefficient, since each
filter will be tried to match to each found row.
> >
> > So, is there a good Map-Reduce praxis for such kind of situations? 
> > (E.g. to make a Get operation inside a map() method.) If yes, could 
> > you kindly point to a good code example?
> >
> > Thank you in advance.
> >
> >
