hadoop-hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Amr Awadallah <...@cloudera.com>
Subject Re: Putting the big table rightmost in the join
Date Fri, 19 Feb 2010 19:42:40 GMT
Ed,

  You apply the hint like this:

SELECT /*+ STREAMTABLE(A) */ ......

that will force table A to be streamed through whether you mention it as
rightmost or not.

What you are asking for is also reasonable, you are asking for a simple
cost-based optimizer that looks at the sizes of the tables in bytes then
automatically apply the STREAMTABLE hint for that table (the only problem
here is a large table in terms of bytes doesn't mean it is large in terms of
rows, that is why table stat collection would be useful).

Yongqiang, besides the STREAMTABLE  and MAPJOIN hints, what other hints are
allowed?

-- amr

On Fri, Feb 19, 2010 at 11:25 AM, Edward Capriolo <edlinuxguru@gmail.com>wrote:

> On Fri, Feb 19, 2010 at 12:35 AM, Yongqiang He
> <heyongqiang@software.ict.ac.cn> wrote:
> > Hi Edward,
> > You can do it with streamtable hint. Hive will put the table in that hint
> in
> > the rightmost.
> >
> > -yongqiang
> > On 2/18/10 3:21 PM, "Edward Capriolo" <edlinuxguru@gmail.com> wrote:
> >
> >> I have worked through this issue.
> >>
> >> * When doing Join, please put the table with big number of rows
> >> containing the same join key to
> >> the rightmost in the JOIN clause. Otherwise we may see OutOfMemory
> errors.
> >>
> >> This advice does work, but should we open up a jira to create a simple
> >> optimizer that does this?
> >>
> >> Edward
> >>
> >>
> >
> >
> >
>
> I do not understand the hint. A user can re-write the query can't they?
>
> select a join b
> select b join a
>
> What I am asking, should we add an optimizer that uses does heuristics
> on the tables and automatically streams the smaller/larger?
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message