hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <tdunn...@maprtech.com>
Subject Re: How to efficiently join HBase tables?
Date Tue, 31 May 2011 20:10:54 GMT
Your mapper can tell which file is being read and add source tags to the
data records.

The reducer can do the cartesian product (if you really need that).

On Tue, May 31, 2011 at 12:19 PM, Eran Kutner <eran@gigya.com> wrote:

> For my need I don't really need the general case, but even if I did I think
> it can probably be done simpler.
> The main problem is getting the data from both tables into the same MR job,
> without resorting to lookups. So without the theoretical
> MutliTableInputFormat, I could just copy all the data from both tables into
> a temp table, just append the source table name to the row keys to make
> sure
> there are no conflicts. When all the data from both tables is in the same
> temp table, run a MR job. For each row the mapper should emit a key which
> is
> composed of all the values of the join fields in that row (the value can be
> emitted as is). This will cause all the rows from both tables, with same
> join field values to arrive at the reducer together. The reducer could then
> iterate over them and produce the Cartesian product as needed.
> I still don't like having to copy all the data into a temp table just
> because I can't feed two tables into the MR job.
> As Jason Rutherglen mentioned above, Hive can do joins. I don't know if it
> can do them for HBase and it will not suit my needs, but it would be
> interesting to know how is it doing them, if anyone knows.
> -eran
> On Tue, May 31, 2011 at 22:02, Ted Dunning <tdunning@maprtech.com> wrote:
> > The Cartesian product often makes an honest-to-god join not such a good
> > idea
> > on large data.  The common alternative is co-group
> > which is basically like doing the hard work of the join, but involves
> > stopping just before emitting the cartesian product.  This allows
> > you to inject whatever cleverness you need at this point.
> >
> > Common kinds of cleverness include down-sampling of problematically large
> > sets of candidates.
> >
> > On Tue, May 31, 2011 at 11:56 AM, Michael Segel
> > <michael_segel@hotmail.com>wrote:
> >
> > > So the underlying problem that the OP was trying to solve was how to
> join
> > > two tables from HBase.
> > > Unfortunately I goofed.
> > > I gave a quick and dirty solution that is a bit incomplete. They row
> key
> > in
> > > the temp table has to be unique and I forgot about the Cartesian
> > > product. So my solution wouldn't work in the general case.
> > >
> >

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message