hadoop-hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Edward Capriolo <edlinuxg...@gmail.com>
Subject Re: Putting the big table rightmost in the join
Date Fri, 19 Feb 2010 20:09:23 GMT
On Fri, Feb 19, 2010 at 2:42 PM, Amr Awadallah <aaa@cloudera.com> wrote:
> Ed,
>
>  You apply the hint like this:
>
> SELECT /*+ STREAMTABLE(A) */ ......
>
> that will force table A to be streamed through whether you mention it as
> rightmost or not.
>
> What you are asking for is also reasonable, you are asking for a simple
> cost-based optimizer that looks at the sizes of the tables in bytes then
> automatically apply the STREAMTABLE hint for that table (the only problem
> here is a large table in terms of bytes doesn't mean it is large in terms of
> rows, that is why table stat collection would be useful).
>
> Yongqiang, besides the STREAMTABLE  and MAPJOIN hints, what other hints are
> allowed?
>
> -- amr
>
> On Fri, Feb 19, 2010 at 11:25 AM, Edward Capriolo <edlinuxguru@gmail.com>wrote:
>
>> On Fri, Feb 19, 2010 at 12:35 AM, Yongqiang He
>> <heyongqiang@software.ict.ac.cn> wrote:
>> > Hi Edward,
>> > You can do it with streamtable hint. Hive will put the table in that hint
>> in
>> > the rightmost.
>> >
>> > -yongqiang
>> > On 2/18/10 3:21 PM, "Edward Capriolo" <edlinuxguru@gmail.com> wrote:
>> >
>> >> I have worked through this issue.
>> >>
>> >> * When doing Join, please put the table with big number of rows
>> >> containing the same join key to
>> >> the rightmost in the JOIN clause. Otherwise we may see OutOfMemory
>> errors.
>> >>
>> >> This advice does work, but should we open up a jira to create a simple
>> >> optimizer that does this?
>> >>
>> >> Edward
>> >>
>> >>
>> >
>> >
>> >
>>
>> I do not understand the hint. A user can re-write the query can't they?
>>
>> select a join b
>> select b join a
>>
>> What I am asking, should we add an optimizer that uses does heuristics
>> on the tables and automatically streams the smaller/larger?
>>
>

I figured looking at table size may be misleading, and we want row
count. I was thinking we can do something similar to what the
TotalOrderPartitioner does, we can sample some rows from some random
blocks. With that information on average row size and table size we
should be able to computer which table has more rows. Not sure how
much overhead this process would add to each query but we could make
that a Conf switch.

Mime
View raw message