hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Segel <michael_se...@hotmail.com>
Subject RE: How to efficiently join HBase tables?
Date Tue, 31 May 2011 20:09:32 GMT


As I said... if you want to do relational database work, you should use a relational database.

The big problem with HBase is that outside of the key, you don't have indexes.
You asked a very general question and we have to assume a general case when we are looking
at creating a solution.
Since you don't have the ability to index your fields and join data, you have to assume that
you are going to be doing a full table scans.
So rather than go row by row and then doing full table scans, for each row found in one table,
the idea is to first get your data from each table separately.
I am again assuming that your subset from Table A is going to be a fraction of the data found
in table A and that your subset from table B is going to be a fraction of the data you're
getting from table B.

So after your initial filtered scan on each table, you could write them out to sequence files,
however, they won't be written in sort order without first doing a reduce job.
(Reduce job will take some time.)

Since you already have a database, writing them to a temp table would probably be faster.
(I guess you could write join column(s) | uuid to get uniqueness which would be smaller than
trying to cat all of the columns and still be written in 'sort order')

You mentioned that you are afraid about the time it takes for table splits. You could pre-split
the table to help alleviate this, but I'm presenting a white board solution that you need
to flesh out.

So now you have your temp table of millions of rows where the row keys are the join key with
a uuid appended to give uniqueness and its in some sort of sort order. (meaning that all of
the keys are grouped.

Now in the actual join, you're Iterating on one temp table, doing a limited scan of the second

Its pretty generic.



> From: eran@gigya.com
> Date: Tue, 31 May 2011 21:42:58 +0300
> Subject: Re: How to efficiently join HBase tables?
> To: user@hbase.apache.org
> Thanks everyone for the great feedback. I'll try to address all the
> suggestions.
> My data sets go between large and very large. One is in the order of many
> billions of rows, although the input for a typical MR job will be in the
> hundreds of millions, the second table is in the tens of millions. I doubt a
> SQL DB will handle this kind of a join in a reasonable manner.
> Doing batched lookup will indeed be more efficient than one by one but it
> will require the mapper to manage a local state between multiple calls,
> which is something I don't really like doing, and worse, it doesn't really
> solve the lookup problem it only moves it one tier lower. Instead of the
> mapper having to do all thouse random lookups, now HBase itself will have to
> do them. Granted it is more efficient than individual lookup API calls but
> it is not nearly as efficient as doing sequential reads.
> Finally, the temp table method, that will work but again, I suspect it will
> be a lot less efficient than the sequence files Hadoop would generate. The
> join output is expected to be in the tens of millions of rows, each with
> multiple columns. From some tests I've done, writing this number of rows to
> a clean table starts out very slowly and takes a lot of time to ramp up as
> the regions begin to split and move around the cluster. I should say that
> the output of this join is just the input for another MR job, so it would
> really be just a temp table and not something that would be useful after
> that.
> I should also say that I have looked into eliminating the lookup altogether
> by resolving the data from the second table before the rows are inserted to
> the main table, kind of denormalization, but that would introduce an
> unacceptable latency to a very high volume process.
> Still looking for other ideas.
> -eran
> On Tue, May 31, 2011 at 18:42, Doug Meil <doug.meil@explorysmedical.com>wrote:
> > Eran's observation was that a join is solvable in a Mapper via lookups on a
> > 2nd HBase table, but it might not be that efficient if the lookups are 1 by
> > 1.  I agree with that.
> >
> > My suggestion was to use multi-Get for the lookups instead.  So you'd hold
> > onto a batch of records in the Mapper and then the batch size is filled,
> > then you do the lookups (and then any required emitting, etc.).
> >
> >
> >
> > -----Original Message-----
> > From: Michael Segel [mailto:michael_segel@hotmail.com]
> > Sent: Tuesday, May 31, 2011 10:56 AM
> > To: user@hbase.apache.org
> > Subject: RE: How to efficiently join HBase tables?
> >
> >
> > Maybe I'm missing something... but this isn't a hard problem to solve.
> >
> > Eran wants to join two tables.
> > If we look at an SQL Statement...
> >
> > SELECT A.*, B.*
> > FROM A, B
> > WHERE A.1 = B.1
> > AND  A.2 = B.2
> > AND  A.3 = xxx
> > AND A.4 = yyy
> > AND B.45 = zzz
> >
> > Or something along those lines.
> >
> > So what you're essentially doing is saying I want to take a subset of data
> > from table A, and a subset of data from table B and join them on the values
> > in columns 1 and 2.
> > Table A's data will be filtered on columns 3 and 4 and B's data will be
> > filtered on column 45. NOTE: since you don't know the relationship of the
> > column names to either table, you're safer in writing tableA|column_name and
> > tableB|column_name to your temp table.
> >
> > So if you create a temp table FOO where the key is column 1 and column 2
> > (column1|column2) then when you walk through the subsets adding them to the
> > temp table, you will get the end result automatically.
> >
> > Then you can output your hbase temp table and then truncate the table.
> >
> > So what am I missing?
> >
> > -Mike
> >
> >
> > > From: doug.meil@explorysmedical.com
> > > To: user@hbase.apache.org
> > > Date: Tue, 31 May 2011 10:22:34 -0400
> > > Subject: RE: How to efficiently join HBase tables?
> > >
> > >
> > > Re:  "The problem is that the few references to that question I found
> > recommend pulling one table to the mapper and then do a lookup for the
> > referred row in the second table."
> > >
> > > With multi-get in .90.x you could perform some reasonably clever
> > processing and not do the lookups one-by-one but in batches.
> > >
> > > Also, if the other table is "small" you could have the leverage the block
> > cache on the lookups (i.e., if it's a domain/lookup table).
> > >
> > >
> > >
> > > -----Original Message-----
> > > From: eran@gigya-inc.com [mailto:eran@gigya-inc.com] On Behalf Of Eran
> > > Kutner
> > > Sent: Tuesday, May 31, 2011 8:06 AM
> > > To: user@hbase.apache.org
> > > Subject: How to efficiently join HBase tables?
> > >
> > > Hi,
> > > I need to join two HBase tables. The obvious way is to use a M/R job for
> > that. The problem is that the few references to that question I found
> > recommend pulling one table to the mapper and then do a lookup for the
> > referred row in the second table.
> > > This sounds like a very inefficient way to do  join with map reduce. I
> > believe it would be much better to feed the rows of both tables to the
> > mapper and let it emit a key based on the join fields. Since all the rows
> > with the same join fields values will have the same key the reducer will be
> > able to easily generate the result of the join.
> > > The problem with this is that I couldn't find a way to feed two tables to
> > a single map reduce job. I could probably dump the tables to files in a
> > single directory and then run the join on the files but that really makes no
> > sense.
> > >
> > > Am I missing something? Any other ideas?
> > >
> > > -eran
> >
> >
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message