hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Phabricator (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-4002) Fetch task aggregation for simple group by query
Date Tue, 27 Aug 2013 15:00:58 GMT

    [ https://issues.apache.org/jira/browse/HIVE-4002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13751330#comment-13751330
] 

Phabricator commented on HIVE-4002:
-----------------------------------

yhuai has commented on the revision "HIVE-4002 [jira] Fetch task aggregation for simple group
by query".

INLINE COMMENTS
  ql/src/java/org/apache/hadoop/hive/ql/exec/FetchOperator.java:493 I think that flush is
only needed for blocking operators. With this optimization, the operator tree in the fetch
task seems only have a single blocking operator which is GBY. Since GBY is the first operator
in the fetch task (the operator shown in flush() in this class), I do not think we need to
call all operators in the operator tree. Is that possible GBY is not the first operator?
  ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java:6985 there are other places
where we are using colInfo.getInternalName(). I think it is better to also change those places
if we want to use field.
  ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java:582 Let's say we have a chain of
operators OP1-OP2-OP3. With this change, when flush in OP1 is called, it will call its flushOp
and then call flushOp in OP2. Seems flush or flushOp in OP3 will never be called. Also, when
I introduced flush with Correlation Optimizer, this method was not designed to propagate the
signal to its children.

REVISION DETAIL
  https://reviews.facebook.net/D8739

To: JIRA, navis
Cc: yhuai

                
> Fetch task aggregation for simple group by query
> ------------------------------------------------
>
>                 Key: HIVE-4002
>                 URL: https://issues.apache.org/jira/browse/HIVE-4002
>             Project: Hive
>          Issue Type: Improvement
>          Components: Query Processor
>            Reporter: Navis
>            Assignee: Navis
>            Priority: Minor
>         Attachments: HIVE-4002.D8739.1.patch, HIVE-4002.D8739.2.patch, HIVE-4002.D8739.3.patch,
HIVE-4002.D8739.4.patch
>
>
> Aggregation queries with no group-by clause (for example, select count(*) from src) executes
final aggregation in single reduce task. But it's too small even for single reducer because
the most of UDAF generates just single row for map aggregation. If final fetch task can aggregate
outputs from map tasks, shuffling time can be removed.
> This optimization transforms operator tree something like,
> TS-FIL-SEL-GBY1-RS-GBY2-SEL-FS + FETCH-TASK
> into 
> TS-FIL-SEL-GBY1-FS + FETCH-TASK(GBY2-SEL-LS)
> With the patch, time taken for auto_join_filters.q test reduced to 6 min (10 min, before).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message