hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michel Segel <michael_se...@hotmail.com>
Subject Re: How to efficiently join HBase tables?
Date Thu, 09 Jun 2011 12:09:03 GMT
>> This sounds effective to me, so long as you can perform any desired
>> operations on the all the rows matching a single join value via a single
>> iteration through the stream of reduce input values (for example, if the
>> set
>> of data for each join value fits in memory).  Otherwise you'd need to put
>> the list of matches from table A some place that you can iterate over it
>> again for each match in table B.


This is why I was suggesting using the temp tables and not trying to do it as a sing map reduce
job.  When your data sets get very large you will have problems......... 

;-)

Sent from a remote device. Please excuse any typos...

Mike Segel

On Jun 9, 2011, at 4:35 AM, Eran Kutner <eran@gigya.com> wrote:

> Exactly!
> Thanks Dave for a much better explanation than mine!
> 
> -eran
> 
> 
> 
> On Thu, Jun 9, 2011 at 00:35, Dave Latham <latham@davelink.net> wrote:
> 
>> I believe this is what Eran is suggesting:
>> 
>> Table A
>> -------
>> Row1 (has joinVal_1)
>> Row2 (has joinVal_2)
>> Row3 (has joinVal_1)
>> 
>> Table B
>> -------
>> Row4 (has joinVal_1)
>> Row5 (has joinVal_3)
>> Row6 (has joinVal_2)
>> 
>> Mapper receives a list of input rows (union of both input tables in any
>> order), and produces (=>) intermediate key, value pairs, where the key is
>> the join field, and the value is whatever portion of the row you want
>> available in your output
>> 
>> Map
>> ----------
>> A, Row1 => (joinVal_1, [A,Row1])
>> A, Row2 => (joinVal_2, [A,Row2])
>> A, Row3 => (joinVal_1, [A,Row3])
>> B, Row4 => (joinVal_1, [B,Row4])
>> B, Row5 => (joinVal_3, [B,Row5])
>> B, Row6 => (joinVal_2, [B,Row6])
>> 
>> Shuffle phase partitions and sorts by the map output key (which is the join
>> value)
>> The Reduce phase then gets a key for the join value and a list of values
>> containing all of the input rows (from either table) with that join value.
>> It can then perform whatever operations you want (like enumerate the subset
>> of the Cartesian product for that join value)
>> 
>> Reduce
>> ------------
>> joinVal_1, {[A,Row1], [A,Row3], [B,Row4]} => Row1 x Row4, Row3 x Row4
>> joinVal_2, {[A,Row2], [B,Row6]} => Row2 x Row6
>> joinVal_3, {[A,Row3]} => {}
>> 
>>>> This sounds effective to me, so long as you can perform any desired
>>>> operations on the all the rows matching a single join value via a single
>>>> iteration through the stream of reduce input values (for example, if the
>>>> set
>>>> of data for each join value fits in memory).  Otherwise you'd need to put
>>>> the list of matches from table A some place that you can iterate over it
>>>> again for each match in table B.
>> 
>> Dave
>> 
>> On Wed, Jun 8, 2011 at 1:45 PM, Buttler, David <buttler1@llnl.gov> wrote:
>> 
>>> Let's make a toy example to see if we can capture all of the edge
>>> conditions:
>>> Table A
>>> -------
>>> Key1 joinVal_1
>>> Key2 joinVal_2
>>> Key3 joinVal_1
>>> 
>>> Table B
>>> -------
>>> Key4 joinVal_1
>>> Key5 joinVal_3
>>> Key6 joinVal_2
>>> 
>>> Now, assume that we have a mapper that takes two values, one row from A,
>>> and one row from B.  Are you suggesting that we get the following map
>> calls:
>>> Key1 & key4
>>> Key2 & key5
>>> Key3 & key6
>>> 
>>> Or are you suggesting we get the following:
>>> Key1 & key4
>>> Key1 & key5
>>> Key1 & key6
>>> Key2 & key4
>>> Key2 & key5
>>> Key2 & key6
>>> Key3 & key4
>>> Key3 & key5
>>> Key3 & key6
>>> 
>>> Or are you suggesting something different?
>>> 
>>> Dave
>>> 
>>> -----Original Message-----
>>> From: eran@gigya-inc.com [mailto:eran@gigya-inc.com] On Behalf Of Eran
>>> Kutner
>>> Sent: Wednesday, June 08, 2011 11:47 AM
>>> To: user@hbase.apache.org
>>> Subject: Re: How to efficiently join HBase tables?
>>> 
>>> I'd like to clarify, again what I'm trying to do and why I still think
>> it's
>>> the best way to do it.
>>> I want to join two large tables, I'm assuming, and this is the key to the
>>> efficiency of this method, that: 1) I'm getting a lot of data from table
>> A,
>>> something which is close enough top a full table scan, and 2) this
>> implies
>>> that I will need to join with most of table B as well.
>>> All the suggestions from the SQL world are doing lookups, one way or
>>> another
>>> in table B. My suggestion is to use the power of the shuffle phase to do
>>> the
>>> join. It is obviously doable, so I don't understand the statement that it
>>> can't be done.
>>> So to go over it again:
>>> 1. You feed all the rows from table A and B into the mapper.
>>> 2. For each row, the mapper should output a new row with a key
>> constructed
>>> from the join fields and a value which is the row itself (same as the
>> input
>>> value it got).
>>> 3. The shuffle phase will make sure all rows with the same values in the
>>> join fields will end up together.
>>> 4. The reducer will get all the rows for a single set of join field
>> values
>>> together and perform the actual join. The reducer can be programmed to do
>>> an
>>> inner or outer join at this point.
>>> 
>>> I can't prove it without actually writing and testing it but I have a
>>> strong
>>> feeling this will be much more efficient for large joins than any form of
>>> lookup.
>>> 
>>> -eran
>>> 
>>> 
>>> 
>>> On Wed, Jun 8, 2011 at 16:01, Doug Meil <doug.meil@explorysmedical.com
>>>> wrote:
>>> 
>>>> 
>>>> Re: " With respect to Doug's posts, you can't do a multi-get off the
>> bat"
>>>> 
>>>> That's an assumption, but you're entitled to your opinion.
>>>> 
>>>> -----Original Message-----
>>>> From: Michael Segel [mailto:michael_segel@hotmail.com]
>>>> Sent: Monday, June 06, 2011 10:08 PM
>>>> To: user@hbase.apache.org
>>>> Subject: RE: How to efficiently join HBase tables?
>>>> 
>>>> 
>>>> Well....
>>>> 
>>>> David, is correct.
>>>> 
>>>> Eran wanted to do a join which is a relational concept that isn't
>>> natively
>>>> supported by a NoSQL database. A better model would be a hierarchical
>>> model
>>>> like Dick Pick's Revelation. (Univers aka U2 from
>> Ardent/Informix/IBM/now
>>>> JRockit?).
>>>> And yes, we're looking back 40 some odd years in to either a merge/sort
>>>> solution or how databases do a relational join. :-)
>>>> 
>>>> Eran wants to do this in a single m/r job. The short answer is you
>> can't.
>>>> Longer answer is that if your main class implements Tool Runner, you
>> can
>>>> launch two jobs in parallel to get your subsets, and then when they
>> both
>>>> complete, you run the join job on them. So I guess its a single 'job'
>> or
>>>> rather app. :-)
>>>> 
>>>> With respect to Doug's posts, you can't do a multi-get off the bat
>>> because
>>>> in the general case you're not fetching based on the row key but a
>> column
>>>> which is not part of the row key. (It could be a foreign key which
>> would
>>>> mean that at least one of your table fetches will be off the row key
>> but
>>> you
>>>> can't guarantee it.)
>>>> 
>>>> So if you don't want to use temp tables, then you have to put your
>>> results
>>>> in a sorted order, and you still want to get the unique set of the
>>> join-keys
>>>> which means you have to run a reduce job. Then you can use the unique
>> key
>>>> set and then do the scans. (You can't do a multi-get because you're
>> doing
>>> a
>>>> scan with a start and stop row(s).)
>>>> 
>>>> The reason I suggest that if you're going to do a join operation, you
>>> want
>>>> to use temp tables because it makes your life easier and probably
>> faster
>>>> too.
>>>> 
>>>> Bottom line... I guess many data architects are going to need rethink
>>> their
>>>> data models when working on big data. :-)
>>>> 
>>>> -Mike
>>>> 
>>>> PS. If I get a spare moment, I may code this up...
>>>> 
>>>> 
>>>>> From: doug.meil@explorysmedical.com
>>>>> To: user@hbase.apache.org
>>>>> Date: Mon, 6 Jun 2011 17:19:44 -0400
>>>>> Subject: RE: How to efficiently join HBase tables?
>>>>> 
>>>>> Re:  " So, you all realize the joins have been talked about in the
>>>> database community for 40 years?"
>>>>> 
>>>>> Great point.  What's old is new!    :-)
>>>>> 
>>>>> My suggested from earlier in the thread was a variant of nested loops
>>> by
>>>> using multi-get in HTable, which would reduce the number of RPC calls.
>>> So
>>>> it's a "bulk-select nested loops" of sorts (i.e., as opposed to the
>>> 1-by-1
>>>> lookup of regular nested loops).
>>>>> 
>>>>> 
>>>>> -----Original Message-----
>>>>> From: Buttler, David [mailto:buttler1@llnl.gov]
>>>>> Sent: Monday, June 06, 2011 4:30 PM
>>>>> To: user@hbase.apache.org
>>>>> Subject: RE: How to efficiently join HBase tables?
>>>>> 
>>>>> So, you all realize the joins have been talked about in the database
>>>> community for 40 years?  There are two main types of joins:
>>>>> Nested loops
>>>>> Hash table
>>>>> 
>>>>> Mike, in his various emails seems to be trying to re-imagine how to
>>>> implement both types of joins in HBase (which seems like a reasonable
>>> goal).
>>>> I am not exactly sure what Eran is going for here, but it seems like
>> Eran
>>> is
>>>> glossing over a piece.  If you have two scanners for table A and B,
>> then
>>>> table B needs to be rescanned for every unique part of the join
>> condition
>>> in
>>>> table A.  There are certain ways of improving the efficiency of that:
>> the
>>>> two most obvious are pushing the selection criteria down to the scans,
>>> and
>>>> scanning all of the same join values from table B at the same time
>> (which
>>>> requires that Table B's key is the join, or a secondary structure that
>>>> stores the join values as the primary order).
>>>>> 
>>>>> Dave
>>>>> 
>>>>> -----Original Message-----
>>>>> From: eran@gigya-inc.com [mailto:eran@gigya-inc.com] On Behalf Of
>> Eran
>>>>> Kutner
>>>>> Sent: Friday, June 03, 2011 12:24 AM
>>>>> To: user@hbase.apache.org
>>>>> Subject: Re: How to efficiently join HBase tables?
>>>>> 
>>>>> Mike, this more or less what I tried to  describe in my initial post,
>>>> only you explained it much better.
>>>>> The problem is that I want to do all of this in one M/R run, not 3
>> and
>>>> without explicit temp tables. If there was only a way to feed both
>> table
>>> A
>>>> and table B into the M/R job then it could be done.
>>>>> 
>>>>> Let's take your query and assumptions, for example.
>>>>> So we configure scanner A to return rows where c=xxx and d=yyy We
>> then
>>>> configure scanner B to return rows where e=zzz Now we feed all those
>> rows
>>> to
>>>> the mapper.
>>>>> For each row the mapper gets it outputs a new key which is "a|b" and
>>> the
>>>> same value it received, if either one doesn't exist in the row the
>> mapper
>>>> doesn't output anything for that row.
>>>>> The is an implicit "temp table" created at this stage by hadoop.
>>>>> Now the reducer is run, for every key "a|b" generated by the mapper
>> it
>>>> would get one or more value sets, each one representing a row from the
>>>> original two tables. For simplicity lets assume we got two rows, one
>> from
>>>> table A the other from table B. Now the reducer can combine the two
>> rows
>>> and
>>>> output the combined row. This will work just the same if there were
>>> multiple
>>>> rows from each table with the same "a|b" key, in that case the reducer
>>> would
>>>> have to generate the Cartesian product of all the rows. Outer joins can
>>> also
>>>> be done this way, in an outer join you only get one row in the reducer
>>> for a
>>>> given "a|b" key but still generate an output.
>>>>> 
>>>>> -eran
>>>>> 
>>>>> 
>>>>> 
>>>>> On Fri, Jun 3, 2011 at 00:05, Michael Segel <
>> michael_segel@hotmail.com
>>>>> wrote:
>>>>> 
>>>>>> 
>>>>>> Not to beat a dead horse, but I thought a bit more about the
>> problem.
>>>>>> If you want to do this all in HBase using a M/R job...
>>>>>> 
>>>>>> Lets define the following:
>>>>>> SELECT *
>>>>>> FROM A, B
>>>>>> WHERE A.a = B.a
>>>>>> AND     A.b = B.b
>>>>>> AND     A.c = xxx
>>>>>> AND     A.d = yyy
>>>>>> AND     B.e = zzz
>>>>>> 
>>>>>> Is the sample query.
>>>>>> 
>>>>>> So our join key is "a|b" because we're matching on columns a and
b.
>>>>>> (The pipe is to delimit the columns, assuming the columns
>> themselves
>>>>>> don't contain pipes...)
>>>>>> 
>>>>>> Our filters on A are c and d while e is the filter on B.
>>>>>> 
>>>>>> So we want to do the following:
>>>>>> 
>>>>>> M/R Map job 1 gets the subset from table A along with a set of
>> unique
>>>> keys.
>>>>>> M/R Map job 2 gets the subset from table B along with a set of
>> unique
>>>> keys.
>>>>>> M/R Map job 3 takes either set of unique keys as the input list and
>>>>>> you split it based on the number of parallel mappers you want to
>> use.
>>>>>> 
>>>>>> You have a couple of options on how you want to proceed.
>>>>>> In each Mapper.map() your input is a unique key.
>>>>>> I guess you could create two scanners, one for tempTableA, and one
>>>>>> for tempTableB.
>>>>>> It looks like you can get the iterator for each result set, and
>> then
>>>>>> for each row in temp table A, you iterate through the result set
>>>>>> from temp table B, writing out the joined set.
>>>>>> 
>>>>>> The only problem is that your result set file isn't in sort order.
>>>>>> So I guess you could take the output from this job and reduce it
to
>>>>>> get it in to sort order.
>>>>>> 
>>>>>> Option B. Using HDFS files for temp 'tables'.
>>>>>> You can do this... but you would still have to track the unique
>> keys
>>>>>> and also sort both the keys and the files which will require a
>> reduce
>>>> job.
>>>>>> 
>>>>>> 
>>>>>> Now this is just my opinion, but if I use HBase, I don't have to
>>>>>> worry about using a reducer except to order the final output set.
>>>>>> So I can save the time it takes to do the reduce step. So I have
to
>>>> ask...
>>>>>> how much time is spent by HBase in splitting and compacting the
>> temp
>>>> tables?
>>>>>> Also can't you pre-split the temp table before you use them?
>>>>>> 
>>>>>> Or am I still missing something?
>>>>>> 
>>>>>> Note: In this example, you'd have to write an input format that
>>>>>> takes a java list object (or something similar) as your input and
>>>>>> then you can split it to get it to run in parallel.
>>>>>> Or you could just write this on the client and split the list up
>> and
>>>>>> run the join in parallel threads on the client node. Or a single
>>>>>> thread which would mean that it would run and output in sort order.
>>>>>> 
>>>>>> HTH
>>>>>> 
>>>>>> -Mike
>>>>>> 
>>>>>>> Date: Wed, 1 Jun 2011 07:47:30 -0700
>>>>>>> Subject: Re: How to efficiently join HBase tables?
>>>>>>> From: jason.rutherglen@gmail.com
>>>>>>> To: user@hbase.apache.org
>>>>>>> 
>>>>>>>> you somehow need to flush all in-memory data *and* perform
a
>>>>>>>> major compaction
>>>>>>> 
>>>>>>> This makes sense.  Without compaction the linear HDFS scan isn't
>>>>>>> possible.  I suppose one could compact 'offline' in a different
>>>>>>> Map Reduce job.  However that would have it's own issues.
>>>>>>> 
>>>>>>>> The files do have a flag if they were made by a major
>>>>>>>> compaction, so you scan only those and ignore the newer ones
-
>>>>>>>> but then you are
>>>>>> trailing
>>>>>>> 
>>>>>>> This could be ok in many cases.  The key would be to create a
>>>>>>> sync'd cut off point enabling a frozen point-in-time 'view' of
>> the
>>>> data.
>>>>>>> I'm not sure how that would be implemented.
>>>>>>> 
>>>>>>> On Wed, Jun 1, 2011 at 6:54 AM, Lars George
>>>>>>> <lars.george@gmail.com>
>>>>>> wrote:
>>>>>>>> Hi Jason,
>>>>>>>> 
>>>>>>>> This was discussed in the past, using the HFileInputFormat.
The
>>>>>>>> issue is that you somehow need to flush all in-memory data
>> *and*
>>>>>>>> perform a major compaction - or else you would need all the
>>>>>>>> logic of the ColumnTracker in the HFIF. Since that needs
to
>> scan
>>>>>>>> all storage files in parallel to achieve its job, the MR
task
>>>>>>>> would not really be able to use the same approach.
>>>>>>>> 
>>>>>>>> Running a major compaction creates a lot of churn, so it
is
>>>>>>>> questionable what the outcome is. The files do have a flag
if
>>>>>>>> they were made by a major compaction, so you scan only those
>> and
>>>>>>>> ignore the newer ones - but then you are trailing, and you
>> still
>>>>>>>> do not handle delete markers/updates in newer files. No easy
>>> feat.
>>>>>>>> 
>>>>>>>> Lars
>>>>>>>> 
>>>>>>>> On Wed, Jun 1, 2011 at 2:41 AM, Jason Rutherglen
>>>>>>>> <jason.rutherglen@gmail.com> wrote:
>>>>>>>>>> I'd imagine that join operations do not require
>> realtime-ness,
>>>>>>>>>> and so faster batch jobs using Hive -> frozen
HBase files in
>>>>>>>>>> HDFS could be the optimal way to go?
>>>>>>>>> 
>>>>>>>>> In addition to lessening the load on the perhaps live
>>>> RegionServer.
>>>>>>>>> There's no Jira for this, I'm tempted to open one.
>>>>>>>>> 
>>>>>>>>> On Tue, May 31, 2011 at 5:18 PM, Jason Rutherglen
>>>>>>>>> <jason.rutherglen@gmail.com> wrote:
>>>>>>>>>>> The Hive-HBase integration allows you to create
Hive tables
>>>>>>>>>>> that are
>>>>>> backed
>>>>>>>>>>> by HBase
>>>>>>>>>> 
>>>>>>>>>> In addition, HBase can be made to go faster for MapReduce
>>>>>>>>>> jobs, if
>>>>>> the
>>>>>>>>>> HFile's could be used directly in HDFS, rather than
proxying
>>>>>>>>>> through the RegionServer.
>>>>>>>>>> 
>>>>>>>>>> I'd imagine that join operations do not require
>> realtime-ness,
>>>>>>>>>> and so faster batch jobs using Hive -> frozen
HBase files in
>>>>>>>>>> HDFS could be the optimal way to go?
>>>>>>>>>> 
>>>>>>>>>> On Tue, May 31, 2011 at 1:41 PM, Patrick Angeles
<
>>>>>> patrick@cloudera.com> wrote:
>>>>>>>>>>> On Tue, May 31, 2011 at 3:19 PM, Eran Kutner
<
>> eran@gigya.com>
>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> For my need I don't really need the general
case, but even
>>>>>>>>>>>> if I did
>>>>>> I think
>>>>>>>>>>>> it can probably be done simpler.
>>>>>>>>>>>> The main problem is getting the data from
both tables into
>>>>>>>>>>>> the same
>>>>>> MR job,
>>>>>>>>>>>> without resorting to lookups. So without
the theoretical
>>>>>>>>>>>> MutliTableInputFormat, I could just copy
all the data from
>>>>>>>>>>>> both
>>>>>> tables into
>>>>>>>>>>>> a temp table, just append the source table
name to the row
>>>>>>>>>>>> keys to
>>>>>> make
>>>>>>>>>>>> sure
>>>>>>>>>>>> there are no conflicts. When all the data
from both tables
>>>>>>>>>>>> is in
>>>>>> the same
>>>>>>>>>>>> temp table, run a MR job. For each row the
mapper should
>>>>>>>>>>>> emit a key
>>>>>> which
>>>>>>>>>>>> is
>>>>>>>>>>>> composed of all the values of the join fields
in that row
>>>>>>>>>>>> (the
>>>>>> value can be
>>>>>>>>>>>> emitted as is). This will cause all the rows
from both
>>>>>>>>>>>> tables, with
>>>>>> same
>>>>>>>>>>>> join field values to arrive at the reducer
together. The
>>>>>>>>>>>> reducer
>>>>>> could then
>>>>>>>>>>>> iterate over them and produce the Cartesian
product as
>>> needed.
>>>>>>>>>>>> 
>>>>>>>>>>>> I still don't like having to copy all the
data into a temp
>>>>>>>>>>>> table
>>>>>> just
>>>>>>>>>>>> because I can't feed two tables into the
MR job.
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> Loading the smaller table in memory is called
a map join,
>>>>>>>>>>> versus a reduce-side join (a.k.a. common join).
One reason
>> to
>>>>>>>>>>> prefer a map
>>>>>> join is
>>>>>>>>>>> you avoid the shuffle phase which potentially
involves
>>>>>>>>>>> several trips
>>>>>> to disk
>>>>>>>>>>> for the intermediate records due to spills, and
also once
>>>>>>>>>>> through
>>>>>> the
>>>>>>>>>>> network to get each intermediate KV pair to the
right
>> reducer.
>>>>>>>>>>> With
>>>>>> a map
>>>>>>>>>>> join, everything is local, except for the part
where you
>> load
>>>>>>>>>>> the
>>>>>> small
>>>>>>>>>>> table.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> As Jason Rutherglen mentioned above, Hive
can do joins. I
>>>>>>>>>>>> don't
>>>>>> know if it
>>>>>>>>>>>> can do them for HBase and it will not suit
my needs, but it
>>>>>>>>>>>> would
>>>>>> be
>>>>>>>>>>>> interesting to know how is it doing them,
if anyone knows.
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> The Hive-HBase integration allows you to create
Hive tables
>>>>>>>>>>> that are
>>>>>> backed
>>>>>>>>>>> by HBase. You can do joins on those tables (and
also with
>>>>>>>>>>> standard
>>>>>> Hive
>>>>>>>>>>> tables). It might be worth trying out in your
case as it
>> lets
>>>>>>>>>>> you
>>>>>> easily see
>>>>>>>>>>> the load characteristics and the job runtime
without much
>>>>>>>>>>> coding
>>>>>> investment.
>>>>>>>>>>> 
>>>>>>>>>>> There are probably some specific optimizations
that can be
>>>>>>>>>>> applied
>>>>>> to your
>>>>>>>>>>> situation, but it's hard to say without knowing
your
>> use-case.
>>>>>>>>>>> 
>>>>>>>>>>> Regards,
>>>>>>>>>>> 
>>>>>>>>>>> - Patrick
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>> -eran
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> On Tue, May 31, 2011 at 22:02, Ted Dunning
>>>>>>>>>>>> <tdunning@maprtech.com>
>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> The Cartesian product often makes an
honest-to-god join
>>>>>>>>>>>>> not such
>>>>>> a good
>>>>>>>>>>>>> idea
>>>>>>>>>>>>> on large data.  The common alternative
is co-group which
>>>>>>>>>>>>> is basically like doing the hard work
of the join, but
>>>>>> involves
>>>>>>>>>>>>> stopping just before emitting the cartesian
product.
>> This
>>>>>>>>>>>>> allows you to inject whatever cleverness
you need at this
>>>> point.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Common kinds of cleverness include down-sampling
of
>>>>>> problematically large
>>>>>>>>>>>>> sets of candidates.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Tue, May 31, 2011 at 11:56 AM, Michael
Segel
>>>>>>>>>>>>> <michael_segel@hotmail.com>wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> So the underlying problem that the
OP was trying to
>>>>>>>>>>>>>> solve was
>>>>>> how to
>>>>>>>>>>>> join
>>>>>>>>>>>>>> two tables from HBase.
>>>>>>>>>>>>>> Unfortunately I goofed.
>>>>>>>>>>>>>> I gave a quick and dirty solution
that is a bit
>>> incomplete.
>>>>>> They row
>>>>>>>>>>>> key
>>>>>>>>>>>>> in
>>>>>>>>>>>>>> the temp table has to be unique and
I forgot about the
>>>>>> Cartesian
>>>>>>>>>>>>>> product. So my solution wouldn't
work in the general
>>> case.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>> 
>>>> 
>>> 
>> 

Mime
View raw message