hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yongqiang He <>
Subject Re: Putting the big table rightmost in the join
Date Fri, 19 Feb 2010 20:40:05 GMT
Yes. I agree. 
Moving work should be done by developers to users is always not
user-friendly. :)
For star-schemes, people can easily find out which table is the small table.
Joining with more than one big table is always risky.

But before a cost optimizer, it seems we have no other choice.

On 2/19/10 11:35 AM, "Edward Capriolo" <> wrote:

> On Fri, Feb 19, 2010 at 2:25 PM, Edward Capriolo <>
> wrote:
>> On Fri, Feb 19, 2010 at 12:35 AM, Yongqiang He
>> <> wrote:
>>> Hi Edward,
>>> You can do it with streamtable hint. Hive will put the table in that hint in
>>> the rightmost.
>>> -yongqiang
>>> On 2/18/10 3:21 PM, "Edward Capriolo" <> wrote:
>>>> I have worked through this issue.
>>>> * When doing Join, please put the table with big number of rows
>>>> containing the same join key to
>>>> the rightmost in the JOIN clause. Otherwise we may see OutOfMemory errors.
>>>> This advice does work, but should we open up a jira to create a simple
>>>> optimizer that does this?
>>>> Edward
>> I do not understand the hint. A user can re-write the query can't they?
>> select a join b
>> select b join a
>> What I am asking, should we add an optimizer that uses does heuristics
>> on the tables and automatically streams the smaller/larger?
> The reason I am mentioning this is I am training hive users right now.
> You can imagine that the first three table join someone did caused an
> OOM. I explained to them roughly how a hive join works and how you
> should move the largest table to one side. They understood but
> replied, "Sounds like something an optimizer could handle."
> Even joining two tables it is a pain to ask someone to find which
> table is larger. Imagine joining 10 or so. Also user perception, image
> your first join throwing in OOM.

View raw message