hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gopal V (JIRA)" <>
Subject [jira] [Commented] (HIVE-5775) Introduce Cost Based Optimizer to Hive
Date Mon, 23 Jun 2014 17:17:25 GMT


Gopal V commented on HIVE-5775:

[~xuefuz]: The CBO model rewrites queries using cardinality statistics.

The tuple count and distinct value count should not affect which physical layer it runs on
- having the CBO split up/reorder a 3-way map-join into 2 phases (or vertices) should generate
identical plans in both.

MR would run 2 Map-only phases with their own local tasks and hashtable uploads, Tez would
run 2 vertices with their own broadcast tasks.

Tez can reduce runtimes further by removing the intermediate IO cost & co-schedule the
second vertex in the same container as the first - but that is not assumed as it is not a
strong guarantee in a busy cluster.

The Tez runtime model is faster, but the logical cost does not change as the number of rows
read off disk, written to disk and distinct keys remain the same.

In fact as it exists today, because it applies equally to both Tez & MR, it ignores a
lot of Tez's opportunistic/runtime optimizations like container-reuse - e.g. "Each vertex
in Tez is a different process".

It is upto the Tez DAG planner to attend to such runtime optimization details.

> Introduce Cost Based Optimizer to Hive
> --------------------------------------
>                 Key: HIVE-5775
>                 URL:
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Laljo John Pullokkaran
>            Assignee: Laljo John Pullokkaran
>         Attachments: CBO-2.pdf, HIVE-5775.1.patch

This message was sent by Atlassian JIRA

View raw message