hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Segel <michael_se...@hotmail.com>
Subject RE: How to efficiently join HBase tables?
Date Tue, 31 May 2011 18:56:22 GMT

Doug, 

I read the OP's post as the following:
"> Hi,
> I need to join two HBase tables. 
The obvious way is to use a M/R job for that. 
The problem is that the few references to that question I found recommend pulling one table
to the mapper 
and then do a lookup for the referred row in the second table.
"

So the underlying problem that the OP was trying to solve was how to join two tables from
HBase.
Unfortunately I goofed.
I gave a quick and dirty solution that is a bit incomplete. They row key in the temp table
has to be unique and I forgot about the Cartesian
product. So my solution wouldn't work in the general case.

If you wanted to do this in the general case...

Map job 1 selects data from the first table and then writes it to tempA where the key is 
the join column(s) in order, followed by all of the rows. (This is a wide key)
Also the map job returns a count of all of the rows in tempA

Map job 2 selects data from the second table again writing the entire return set as the key
so that each row is also unique.

Now take the smaller of two sets and run a map reduce job. 
For each row, you'd want to create a scan of the other table, using a start and stop row based
on the value in the joined columns.
Write out that row and the corresponding row from the scan in to your result set... but here
I think you just need a unique key if you're just
dumping this to file, otherwise you will want to write the entire record as your key and each
column as your value.

This will account for the Cartesian product in the join.

So it will look something like this:
Tab A
K, 1,2,3
   a,b,c
   a,a,b
   b,b,a
   b,c,a
   c,b,a

Tab B
K, 1,2,3
   a,e,f
   a,c,d
   b,z,z

SELECT *
FROM A, B
WHERE A.1 = B.1
AND   A.1 = a


Tab A
K	1,2,3
a|b|c	a,b,c
a|a|b	a,a,b

Tab B 
a|e|f	a,e,f
a|c|d	a,c,d

So output = 
K		1,2,3,4,5,6  
a|b|c|e|f	a,b,c,a,e,f
a|b|c|c|d	a,b,c,a,c,d
a|a|b|e|f	a,a,b,a,e,f
a|a|b|c|d	a,a,b,a,c,d


I'm looking at the http://hbase.apache.org/docs/current/api/
Didn't see a multi-get... not sure it would buy you much. My suggestion takes advantage of
the fact that HBase puts things in sort order for you.

I think the above solution will work in the general sense, however what's the max width of
the row key?


-Mike





> From: doug.meil@explorysmedical.com
> To: user@hbase.apache.org
> Date: Tue, 31 May 2011 11:42:27 -0400
> Subject: RE: How to efficiently join HBase tables?
> 
> Eran's observation was that a join is solvable in a Mapper via lookups on a 2nd HBase
table, but it might not be that efficient if the lookups are 1 by 1.  I agree with that.
> 
> My suggestion was to use multi-Get for the lookups instead.  So you'd hold onto a batch
of records in the Mapper and then the batch size is filled, then you do the lookups (and then
any required emitting, etc.).  
> 
> 
> 
> -----Original Message-----
> From: Michael Segel [mailto:michael_segel@hotmail.com] 
> Sent: Tuesday, May 31, 2011 10:56 AM
> To: user@hbase.apache.org
> Subject: RE: How to efficiently join HBase tables?
> 
> 
> Maybe I'm missing something... but this isn't a hard problem to solve.
> 
> Eran wants to join two tables.
> If we look at an SQL Statement...
> 
> SELECT A.*, B.*
> FROM A, B
> WHERE A.1 = B.1
> AND  A.2 = B.2
> AND  A.3 = xxx
> AND A.4 = yyy
> AND B.45 = zzz
> 
> Or something along those lines.
> 
> So what you're essentially doing is saying I want to take a subset of data from table
A, and a subset of data from table B and join them on the values in columns 1 and 2.
> Table A's data will be filtered on columns 3 and 4 and B's data will be filtered on column
45. NOTE: since you don't know the relationship of the column names to either table, you're
safer in writing tableA|column_name and tableB|column_name to your temp table.
> 
> So if you create a temp table FOO where the key is column 1 and column 2 (column1|column2)
then when you walk through the subsets adding them to the temp table, you will get the end
result automatically.
> 
> Then you can output your hbase temp table and then truncate the table.
> 
> So what am I missing?
> 
> -Mike
> 
> 
> > From: doug.meil@explorysmedical.com
> > To: user@hbase.apache.org
> > Date: Tue, 31 May 2011 10:22:34 -0400
> > Subject: RE: How to efficiently join HBase tables?
> > 
> > 
> > Re:  "The problem is that the few references to that question I found recommend
pulling one table to the mapper and then do a lookup for the referred row in the second table."
> > 
> > With multi-get in .90.x you could perform some reasonably clever processing and
not do the lookups one-by-one but in batches.
> > 
> > Also, if the other table is "small" you could have the leverage the block cache
on the lookups (i.e., if it's a domain/lookup table).  
> > 
> > 
> > 
> > -----Original Message-----
> > From: eran@gigya-inc.com [mailto:eran@gigya-inc.com] On Behalf Of Eran 
> > Kutner
> > Sent: Tuesday, May 31, 2011 8:06 AM
> > To: user@hbase.apache.org
> > Subject: How to efficiently join HBase tables?
> > 
> > Hi,
> > I need to join two HBase tables. The obvious way is to use a M/R job for that. The
problem is that the few references to that question I found recommend pulling one table to
the mapper and then do a lookup for the referred row in the second table.
> > This sounds like a very inefficient way to do  join with map reduce. I believe it
would be much better to feed the rows of both tables to the mapper and let it emit a key based
on the join fields. Since all the rows with the same join fields values will have the same
key the reducer will be able to easily generate the result of the join.
> > The problem with this is that I couldn't find a way to feed two tables to a single
map reduce job. I could probably dump the tables to files in a single directory and then run
the join on the files but that really makes no sense.
> > 
> > Am I missing something? Any other ideas?
> > 
> > -eran
>  		 	   		  
 		 	   		  
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message