hadoop-hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Edward Capriolo <edlinuxg...@gmail.com>
Subject Re: Putting the big table rightmost in the join
Date Fri, 19 Feb 2010 19:35:47 GMT
On Fri, Feb 19, 2010 at 2:25 PM, Edward Capriolo <edlinuxguru@gmail.com> wrote:
> On Fri, Feb 19, 2010 at 12:35 AM, Yongqiang He
> <heyongqiang@software.ict.ac.cn> wrote:
>> Hi Edward,
>> You can do it with streamtable hint. Hive will put the table in that hint in
>> the rightmost.
>> -yongqiang
>> On 2/18/10 3:21 PM, "Edward Capriolo" <edlinuxguru@gmail.com> wrote:
>>> I have worked through this issue.
>>> * When doing Join, please put the table with big number of rows
>>> containing the same join key to
>>> the rightmost in the JOIN clause. Otherwise we may see OutOfMemory errors.
>>> This advice does work, but should we open up a jira to create a simple
>>> optimizer that does this?
>>> Edward
> I do not understand the hint. A user can re-write the query can't they?
> select a join b
> select b join a
> What I am asking, should we add an optimizer that uses does heuristics
> on the tables and automatically streams the smaller/larger?

The reason I am mentioning this is I am training hive users right now.
You can imagine that the first three table join someone did caused an
OOM. I explained to them roughly how a hive join works and how you
should move the largest table to one side. They understood but
replied, "Sounds like something an optimizer could handle."

Even joining two tables it is a pain to ask someone to find which
table is larger. Imagine joining 10 or so. Also user perception, image
your first join throwing in OOM.

View raw message