hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Segel <michael_se...@hotmail.com>
Subject RE: Using external indexes in an HBase Map/Reduce job...
Date Tue, 12 Oct 2010 16:20:35 GMT


Its a bit more complicated than that.

What I can say is that I have a billion rows of data. 
I want to pull a specific 100K rows from the table. 

The row keys are not contiguous and you could say they are 'random' such that if I were to
do a table scan, I'd have to scan the entire table (All regions).

Now if I had a list of the 100k rows. From a single client I could just create 100 threads
and grab rows from HBase one at a time in each thread.

But in a m/r, I can't really do that.  (I want to do processing on the data I get returned.)

So given a List Object with the row keys, how do I do a map reduce with this list as the starting

Sure I could write it to HDFS and then do a m/r reading from the file and setting my own splits
to control parallelism. 
But I'm hoping for a more elegant solution.

I know that its possible, but I haven't thought it out... Was hoping someone else had this


> From: buttler1@llnl.gov
> To: user@hbase.apache.org
> Date: Tue, 12 Oct 2010 08:35:25 -0700
> Subject: RE: Using external indexes in an HBase Map/Reduce job...
> Sorry, I am not clear on exactly what you are trying to accomplish here.  I have a table
roughly of that size, and it doesn't seem to cause me any trouble.  I also have a few separate
solr indexes for data in the table for query -- the solr query syntax is sufficient for my
current needs.  This setup allows me to do two things efficiently:
> 1) batch processing of all records (e.g. tagging records that match a particular criteria)
> 2) search/lookup from a UI in an online manner
> 3) it is also fairly easy to insert a bunch of records (keeping track of their keys),
and then run various batch processes only over those new records -- essentially doing what
you suggest: create a file of keys and split the map task over that file.
> Dave
> -----Original Message-----
> From: Michael Segel [mailto:michael_segel@hotmail.com] 
> Sent: Tuesday, October 12, 2010 5:36 AM
> To: hbase-user@hadoop.apache.org
> Subject: Using external indexes in an HBase Map/Reduce job...
> Hi,
> Now I realize that most everyone is sitting in NY, while some of us can't leave our respective
> Came across this problem and I was wondering how others solved it.
> Suppose you have a really large table with 1 billion rows of data. 
> Since HBase really doesn't have any indexes built in (Don't get me started about the
contrib/transactional stuff...), you're forced to use some sort of external index, or roll
your own index table.
> The net result is that you end up with a list object that contains your result set.
> So the question is... what's the best way to feed the list object in?
> One option I thought about is writing the object to a file and then using it as the file
in and then control the splitters. Not the most efficient but it would work.
> Was trying to find a more 'elegant' solution and I'm sure that anyone using SOLR or LUCENE
or whatever... had come across this problem too.
> Any suggestions? 
> Thx
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message