hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michel Segel <michael_se...@hotmail.com>
Subject Re: How to efficiently join HBase tables?
Date Thu, 09 Jun 2011 12:02:16 GMT
Doug,
I think I should clarify something...

Yes I am the only one who is saying get() won't work. 
The question was asked on how to do an efficient join where there were no specific parameters
like joining on key values. It wasn't until yesterday that Eran gave an example of the specific
problem...

So you have to make some assumptions that you are attempting to solve this problem in the
general case. That is you cant assume that you are joining on the row keys. Since we are talking
about a big data problem you can assume that your data sets are going to be huge. 
So you have to consider how many rows you can store in memory and that you may not have enough
memory.

Because most who are moving to a NoSQL database only have been exposed to relational models,
they will approach HBase from a relational schema design. So solving the question of how to
join two tables efficiently in general terms has a lot of value to the community as a whole.


David is right in that I'm looking back and taking the approach of how joining two data sets.
It's not rocket science and this problem has been solved under different paradigms.

Sent from a remote device. Please excuse any typos...

Mike Segel

On Jun 8, 2011, at 9:56 PM, Doug Meil <doug.meil@explorysmedical.com> wrote:

> Hi there-
> 
> Summary comment:
> 
> 1)  Preference
> 
> Several people in this thread have suggested approaches (map-side memory join, multi-get,
temp files), all of which have merit and have advantages in certain situations.  Kudos to
the dist-list for chiming in.  The "right" approach depends on the specific problem you are
trying to solve, and that's preference.
> 
> 2)   Possibility
> 
> Mike, you're the only one in this thread arguing that one of the approaches isn't possible.
 And you seem to be arguing that it's never possible to look up a record in HBase with a Get,
but others don't seem to have this problem.
> 
> I wish everybody success with their HBase join research, but I'm checking out of this
thread (ka-ching).
> 
> 
> -----Original Message-----
> From: im_gumby@hotmail.com [mailto:im_gumby@hotmail.com] On Behalf Of Michel Segel
> Sent: Wednesday, June 08, 2011 10:14 AM
> To: user@hbase.apache.org
> Subject: Re: How to efficiently join HBase tables?
> 
> Unless I am mistaken... get() requires a row key, right?
> And you can join tables on column data which isn't in the row key, right?
> 
> So how do you do a get()? :-)
> 
> Sure there is more than one way to skin a cat. But if you want to be efficient... You
will create a set of unique keys based on the columns that you want to join. Note that if
you are going to use a temp table in hbase, you will want to store the unique key value A|B
and when you write the row to the temp table, you will append an unique identifier like a
uuid so that you don't lose the row.
> 
> Here your input list to the actual join is going to be the list of unique keys and then
you do a scan to get the rows.
> 
> Again, I could be wrong but how can you perform a get() when you only know a portion
of the row key?
> 
> 
> 
> Sent from a remote device. Please excuse any typos...
> 
> Mike Segel
> 
> On Jun 8, 2011, at 8:01 AM, Doug Meil <doug.meil@explorysmedical.com> wrote:
> 
>> 
>> Re: " With respect to Doug's posts, you can't do a multi-get off the bat"
>> 
>> That's an assumption, but you're entitled to your opinion.
>> 
>> -----Original Message-----
>> From: Michael Segel [mailto:michael_segel@hotmail.com]
>> Sent: Monday, June 06, 2011 10:08 PM
>> To: user@hbase.apache.org
>> Subject: RE: How to efficiently join HBase tables?
>> 
>> 
>> Well....
>> 
>> David, is correct.
>> 
>> Eran wanted to do a join which is a relational concept that isn't natively supported
by a NoSQL database. A better model would be a hierarchical model like Dick Pick's Revelation.
(Univers aka U2 from Ardent/Informix/IBM/now JRockit?).
>> And yes, we're looking back 40 some odd years in to either a
>> merge/sort solution or how databases do a relational join. :-)
>> 
>> Eran wants to do this in a single m/r job. The short answer is you
>> can't.  Longer answer is that if your main class implements Tool
>> Runner, you can launch two jobs in parallel to get your subsets, and
>> then when they both complete, you run the join job on them. So I guess
>> its a single 'job' or rather app. :-)
>> 
>> With respect to Doug's posts, you can't do a multi-get off the bat
>> because in the general case you're not fetching based on the row key
>> but a column which is not part of the row key. (It could be a foreign
>> key which would mean that at least one of your table fetches will be
>> off the row key but you can't guarantee it.)
>> 
>> So if you don't want to use temp tables, then you have to put your
>> results in a sorted order, and you still want to get the unique set of
>> the join-keys which means you have to run a reduce job. Then you can
>> use the unique key set and then do the scans. (You can't do a
>> multi-get because you're doing a scan with a start and stop row(s).)
>> 
>> The reason I suggest that if you're going to do a join operation, you want to use
temp tables because it makes your life easier and probably faster too.
>> 
>> Bottom line... I guess many data architects are going to need rethink
>> their data models when working on big data. :-)
>> 
>> -Mike
>> 
>> PS. If I get a spare moment, I may code this up...
>> 
>> 
>>> From: doug.meil@explorysmedical.com
>>> To: user@hbase.apache.org
>>> Date: Mon, 6 Jun 2011 17:19:44 -0400
>>> Subject: RE: How to efficiently join HBase tables?
>>> 
>>> Re:  " So, you all realize the joins have been talked about in the database community
for 40 years?"
>>> 
>>> Great point.  What's old is new!    :-)
>>> 
>>> My suggested from earlier in the thread was a variant of nested loops by using
multi-get in HTable, which would reduce the number of RPC calls.  So it's a "bulk-select nested
loops" of sorts (i.e., as opposed to the 1-by-1 lookup of regular nested loops).
>>> 
>>> 
>>> -----Original Message-----
>>> From: Buttler, David [mailto:buttler1@llnl.gov]
>>> Sent: Monday, June 06, 2011 4:30 PM
>>> To: user@hbase.apache.org
>>> Subject: RE: How to efficiently join HBase tables?
>>> 
>>> So, you all realize the joins have been talked about in the database community
for 40 years?  There are two main types of joins:
>>> Nested loops
>>> Hash table
>>> 
>>> Mike, in his various emails seems to be trying to re-imagine how to implement
both types of joins in HBase (which seems like a reasonable goal). I am not exactly sure what
Eran is going for here, but it seems like Eran is glossing over a piece.  If you have two
scanners for table A and B, then table B needs to be rescanned for every unique part of the
join condition in table A.  There are certain ways of improving the efficiency of that: the
two most obvious are pushing the selection criteria down to the scans, and scanning all of
the same join values from table B at the same time (which requires that Table B's key is the
join, or a secondary structure that stores the join values as the primary order).
>>> 
>>> Dave
>>> 
>>> -----Original Message-----
>>> From: eran@gigya-inc.com [mailto:eran@gigya-inc.com] On Behalf Of
>>> Eran Kutner
>>> Sent: Friday, June 03, 2011 12:24 AM
>>> To: user@hbase.apache.org
>>> Subject: Re: How to efficiently join HBase tables?
>>> 
>>> Mike, this more or less what I tried to  describe in my initial post, only you
explained it much better.
>>> The problem is that I want to do all of this in one M/R run, not 3 and without
explicit temp tables. If there was only a way to feed both table A and table B into the M/R
job then it could be done.
>>> 
>>> Let's take your query and assumptions, for example.
>>> So we configure scanner A to return rows where c=xxx and d=yyy We then configure
scanner B to return rows where e=zzz Now we feed all those rows to the mapper.
>>> For each row the mapper gets it outputs a new key which is "a|b" and the same
value it received, if either one doesn't exist in the row the mapper doesn't output anything
for that row.
>>> The is an implicit "temp table" created at this stage by hadoop.
>>> Now the reducer is run, for every key "a|b" generated by the mapper it would
get one or more value sets, each one representing a row from the original two tables. For
simplicity lets assume we got two rows, one from table A the other from table B. Now the reducer
can combine the two rows and output the combined row. This will work just the same if there
were multiple rows from each table with the same "a|b" key, in that case the reducer would
have to generate the Cartesian product of all the rows. Outer joins can also be done this
way, in an outer join you only get one row in the reducer for a given "a|b" key but still
generate an output.
>>> 
>>> -eran
>>> 
>>> 
>>> 
>>> On Fri, Jun 3, 2011 at 00:05, Michael Segel <michael_segel@hotmail.com>wrote:
>>> 
>>>> 
>>>> Not to beat a dead horse, but I thought a bit more about the problem.
>>>> If you want to do this all in HBase using a M/R job...
>>>> 
>>>> Lets define the following:
>>>> SELECT *
>>>> FROM A, B
>>>> WHERE A.a = B.a
>>>> AND     A.b = B.b
>>>> AND     A.c = xxx
>>>> AND     A.d = yyy
>>>> AND     B.e = zzz
>>>> 
>>>> Is the sample query.
>>>> 
>>>> So our join key is "a|b" because we're matching on columns a and b.
>>>> (The pipe is to delimit the columns, assuming the columns themselves
>>>> don't contain pipes...)
>>>> 
>>>> Our filters on A are c and d while e is the filter on B.
>>>> 
>>>> So we want to do the following:
>>>> 
>>>> M/R Map job 1 gets the subset from table A along with a set of unique keys.
>>>> M/R Map job 2 gets the subset from table B along with a set of unique keys.
>>>> M/R Map job 3 takes either set of unique keys as the input list and
>>>> you split it based on the number of parallel mappers you want to use.
>>>> 
>>>> You have a couple of options on how you want to proceed.
>>>> In each Mapper.map() your input is a unique key.
>>>> I guess you could create two scanners, one for tempTableA, and one
>>>> for tempTableB.
>>>> It looks like you can get the iterator for each result set, and then
>>>> for each row in temp table A, you iterate through the result set
>>>> from temp table B, writing out the joined set.
>>>> 
>>>> The only problem is that your result set file isn't in sort order.
>>>> So I guess you could take the output from this job and reduce it to
>>>> get it in to sort order.
>>>> 
>>>> Option B. Using HDFS files for temp 'tables'.
>>>> You can do this... but you would still have to track the unique keys
>>>> and also sort both the keys and the files which will require a reduce job.
>>>> 
>>>> 
>>>> Now this is just my opinion, but if I use HBase, I don't have to
>>>> worry about using a reducer except to order the final output set.
>>>> So I can save the time it takes to do the reduce step. So I have to ask...
>>>> how much time is spent by HBase in splitting and compacting the temp tables?
>>>> Also can't you pre-split the temp table before you use them?
>>>> 
>>>> Or am I still missing something?
>>>> 
>>>> Note: In this example, you'd have to write an input format that
>>>> takes a java list object (or something similar) as your input and
>>>> then you can split it to get it to run in parallel.
>>>> Or you could just write this on the client and split the list up and
>>>> run the join in parallel threads on the client node. Or a single
>>>> thread which would mean that it would run and output in sort order.
>>>> 
>>>> HTH
>>>> 
>>>> -Mike
>>>> 
>>>>> Date: Wed, 1 Jun 2011 07:47:30 -0700
>>>>> Subject: Re: How to efficiently join HBase tables?
>>>>> From: jason.rutherglen@gmail.com
>>>>> To: user@hbase.apache.org
>>>>> 
>>>>>> you somehow need to flush all in-memory data *and* perform a major
>>>>>> compaction
>>>>> 
>>>>> This makes sense.  Without compaction the linear HDFS scan isn't
>>>>> possible.  I suppose one could compact 'offline' in a different Map
>>>>> Reduce job.  However that would have it's own issues.
>>>>> 
>>>>>> The files do have a flag if they were made by a major compaction,
>>>>>> so you scan only those and ignore the newer ones - but then you
>>>>>> are
>>>> trailing
>>>>> 
>>>>> This could be ok in many cases.  The key would be to create a
>>>>> sync'd cut off point enabling a frozen point-in-time 'view' of the data.
>>>>> I'm not sure how that would be implemented.
>>>>> 
>>>>> On Wed, Jun 1, 2011 at 6:54 AM, Lars George <lars.george@gmail.com>
>>>> wrote:
>>>>>> Hi Jason,
>>>>>> 
>>>>>> This was discussed in the past, using the HFileInputFormat. The
>>>>>> issue is that you somehow need to flush all in-memory data *and*
>>>>>> perform a major compaction - or else you would need all the logic
>>>>>> of the ColumnTracker in the HFIF. Since that needs to scan all
>>>>>> storage files in parallel to achieve its job, the MR task would
>>>>>> not really be able to use the same approach.
>>>>>> 
>>>>>> Running a major compaction creates a lot of churn, so it is
>>>>>> questionable what the outcome is. The files do have a flag if they
>>>>>> were made by a major compaction, so you scan only those and ignore
>>>>>> the newer ones - but then you are trailing, and you still do not
>>>>>> handle delete markers/updates in newer files. No easy feat.
>>>>>> 
>>>>>> Lars
>>>>>> 
>>>>>> On Wed, Jun 1, 2011 at 2:41 AM, Jason Rutherglen
>>>>>> <jason.rutherglen@gmail.com> wrote:
>>>>>>>> I'd imagine that join operations do not require realtime-ness,
>>>>>>>> and so faster batch jobs using Hive -> frozen HBase files
in
>>>>>>>> HDFS could be the optimal way to go?
>>>>>>> 
>>>>>>> In addition to lessening the load on the perhaps live RegionServer.
>>>>>>> There's no Jira for this, I'm tempted to open one.
>>>>>>> 
>>>>>>> On Tue, May 31, 2011 at 5:18 PM, Jason Rutherglen
>>>>>>> <jason.rutherglen@gmail.com> wrote:
>>>>>>>>> The Hive-HBase integration allows you to create Hive
tables
>>>>>>>>> that are
>>>> backed
>>>>>>>>> by HBase
>>>>>>>> 
>>>>>>>> In addition, HBase can be made to go faster for MapReduce
jobs,
>>>>>>>> if
>>>> the
>>>>>>>> HFile's could be used directly in HDFS, rather than proxying
>>>>>>>> through the RegionServer.
>>>>>>>> 
>>>>>>>> I'd imagine that join operations do not require realtime-ness,
>>>>>>>> and so faster batch jobs using Hive -> frozen HBase files
in
>>>>>>>> HDFS could be the optimal way to go?
>>>>>>>> 
>>>>>>>> On Tue, May 31, 2011 at 1:41 PM, Patrick Angeles <
>>>> patrick@cloudera.com> wrote:
>>>>>>>>> On Tue, May 31, 2011 at 3:19 PM, Eran Kutner <eran@gigya.com>
>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> For my need I don't really need the general case,
but even if
>>>>>>>>>> I did
>>>> I think
>>>>>>>>>> it can probably be done simpler.
>>>>>>>>>> The main problem is getting the data from both tables
into the
>>>>>>>>>> same
>>>> MR job,
>>>>>>>>>> without resorting to lookups. So without the theoretical
>>>>>>>>>> MutliTableInputFormat, I could just copy all the
data from
>>>>>>>>>> both
>>>> tables into
>>>>>>>>>> a temp table, just append the source table name to
the row
>>>>>>>>>> keys to
>>>> make
>>>>>>>>>> sure
>>>>>>>>>> there are no conflicts. When all the data from both
tables is
>>>>>>>>>> in
>>>> the same
>>>>>>>>>> temp table, run a MR job. For each row the mapper
should emit
>>>>>>>>>> a key
>>>> which
>>>>>>>>>> is
>>>>>>>>>> composed of all the values of the join fields in
that row (the
>>>> value can be
>>>>>>>>>> emitted as is). This will cause all the rows from
both tables,
>>>>>>>>>> with
>>>> same
>>>>>>>>>> join field values to arrive at the reducer together.
The
>>>>>>>>>> reducer
>>>> could then
>>>>>>>>>> iterate over them and produce the Cartesian product
as needed.
>>>>>>>>>> 
>>>>>>>>>> I still don't like having to copy all the data into
a temp
>>>>>>>>>> table
>>>> just
>>>>>>>>>> because I can't feed two tables into the MR job.
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Loading the smaller table in memory is called a map join,
>>>>>>>>> versus a reduce-side join (a.k.a. common join). One reason
to
>>>>>>>>> prefer a map
>>>> join is
>>>>>>>>> you avoid the shuffle phase which potentially involves
several
>>>>>>>>> trips
>>>> to disk
>>>>>>>>> for the intermediate records due to spills, and also
once
>>>>>>>>> through
>>>> the
>>>>>>>>> network to get each intermediate KV pair to the right
reducer.
>>>>>>>>> With
>>>> a map
>>>>>>>>> join, everything is local, except for the part where
you load
>>>>>>>>> the
>>>> small
>>>>>>>>> table.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> As Jason Rutherglen mentioned above, Hive can do
joins. I
>>>>>>>>>> don't
>>>> know if it
>>>>>>>>>> can do them for HBase and it will not suit my needs,
but it
>>>>>>>>>> would
>>>> be
>>>>>>>>>> interesting to know how is it doing them, if anyone
knows.
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> The Hive-HBase integration allows you to create Hive
tables
>>>>>>>>> that are
>>>> backed
>>>>>>>>> by HBase. You can do joins on those tables (and also
with
>>>>>>>>> standard
>>>> Hive
>>>>>>>>> tables). It might be worth trying out in your case as
it lets
>>>>>>>>> you
>>>> easily see
>>>>>>>>> the load characteristics and the job runtime without
much
>>>>>>>>> coding
>>>> investment.
>>>>>>>>> 
>>>>>>>>> There are probably some specific optimizations that can
be
>>>>>>>>> applied
>>>> to your
>>>>>>>>> situation, but it's hard to say without knowing your
use-case.
>>>>>>>>> 
>>>>>>>>> Regards,
>>>>>>>>> 
>>>>>>>>> - Patrick
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> -eran
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Tue, May 31, 2011 at 22:02, Ted Dunning
>>>>>>>>>> <tdunning@maprtech.com>
>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> The Cartesian product often makes an honest-to-god
join not
>>>>>>>>>>> such
>>>> a good
>>>>>>>>>>> idea
>>>>>>>>>>> on large data.  The common alternative is co-group
which is
>>>>>>>>>>> basically like doing the hard work of the join,
but
>>>> involves
>>>>>>>>>>> stopping just before emitting the cartesian product.
 This
>>>>>>>>>>> allows you to inject whatever cleverness you
need at this point.
>>>>>>>>>>> 
>>>>>>>>>>> Common kinds of cleverness include down-sampling
of
>>>> problematically large
>>>>>>>>>>> sets of candidates.
>>>>>>>>>>> 
>>>>>>>>>>> On Tue, May 31, 2011 at 11:56 AM, Michael Segel
>>>>>>>>>>> <michael_segel@hotmail.com>wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> So the underlying problem that the OP was
trying to solve
>>>>>>>>>>>> was
>>>> how to
>>>>>>>>>> join
>>>>>>>>>>>> two tables from HBase.
>>>>>>>>>>>> Unfortunately I goofed.
>>>>>>>>>>>> I gave a quick and dirty solution that is
a bit incomplete.
>>>> They row
>>>>>>>>>> key
>>>>>>>>>>> in
>>>>>>>>>>>> the temp table has to be unique and I forgot
about the
>>>> Cartesian
>>>>>>>>>>>> product. So my solution wouldn't work in
the general case.
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>> 
>>>> 
>> 
>> 
> 

Mime
View raw message