hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Phabricator (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-3562) Some limit can be pushed down to map stage
Date Sat, 24 Aug 2013 03:57:52 GMT

    [ https://issues.apache.org/jira/browse/HIVE-3562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13749290#comment-13749290
] 

Phabricator commented on HIVE-3562:
-----------------------------------

navis has commented on the revision "HIVE-3562 [jira] Some limit can be pushed down to map
stage".

  Thanks for a review.

INLINE COMMENTS
  common/src/java/org/apache/hadoop/hive/conf/HiveConf.java:528 I thought it's a convention
of hive to make configurations for enable/disable and threshold each, which was requested
several times till now for me for other issues. I'll address that.
  ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java:1187 Legend : A(a) -->
key A, value a, row A(a)

  If each mapper takes rows lite this
  MAP1(RS) : 40(a)-10(b)-30(c)-10(d)-70(e)-80(f)
  MAP2(RS) : 90(g)-80(h)-60(i)-40(j)-30(k)-20(l)
  MAP3(RS) : 40(m)-50(n)-30(o)-30(p)-60(q)-70(r)

  REDUCER : 10(b,d)-20(l)-30(c,k,o)-40(a,j,m)-50(n)-60(i,q)-70(e,r)-80(f,h)-90(g)
  REDUCER(LIMIT 3) : 10(b,d)-20(l)-30(c,k,o)

  with this patch

  MAP1 : 40(a)-10(b)-30(c)-10(d)
  MAP2 : 40(j)-30(k)-20(l)
  MAP3 : 40(m)-50(n)-30(o)-30(p)

  REDUCER : 10(b,d)-20(l)-30(c,k,o)-40(a,j,m)-50(n)
  REDUCER(LIMIT 3) : 10(b,d)-20(l)-30(c,k,o)
  ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:154 I thought limit 0
is not realistic and can be used for some kind of testing. I'll address that case but making
Hash seemed a little much.
  ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:157 ReduceSinkHash makes
arrays for key/value/hashes. It's for compensating those (not exact but I like the number).
  ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:325 I think there was
a test for RS operator which has no collector. I'll check that.
  ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:340 Right. I've missed
that.
  ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:343 It's possible to
add key/value to hash which is flushed right before. But I just forwarded it.
  ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:346 Right.
  ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:434-436 It's not serializable
class. But I'll add transient mod.
  ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:471 I'll leave log message
in the processOp() method.
  ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:477 Return value can
be false only if it's for GBY. False means the key exists already in RS hash.
  For case A(a)->A(b)->B(c) with limit 5, the first A(a) will be stored into hash. For
the second A(b), value b should be stored, which means values should be byte[][][] using additional
array for each index consuming more memory. But I suspected that case could be a common case
with map aggregation. So I just decided to forward A(b) to output collector.
  ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:495 It does not remove
anything. This is called after index is removed from RS hash.
  ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:498 Good.
  ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:535 Main difference of
HashForOrder from HashForGroupBy is that it should store duplicated keys. As described in
comments, A->B->B->C limit 3 should store A,B,B not A,B,C. I believe this behavior
can be possibly implemented with simple TreeSet but thought this is way simpler.
  ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:566 I prefer Map rather
than Set but I'll change that.
  ql/src/test/queries/clientpositive/limit_pushdown.q:14 ok
  ql/src/test/queries/clientpositive/limit_pushdown.q:43 ok.
  ql/src/java/org/apache/hadoop/hive/ql/optimizer/LimitPushdownOptimizer.java:46 This is not
exact description. RS for limit (not last RS). I'll change this, too.
  ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:253-264 Looks promising.
Need more investigation on implementation of distinct.

REVISION DETAIL
  https://reviews.facebook.net/D5967

BRANCH
  HIVE-3562

ARCANIST PROJECT
  hive

To: JIRA, tarball, ashutoshc, navis
Cc: njain

                
> Some limit can be pushed down to map stage
> ------------------------------------------
>
>                 Key: HIVE-3562
>                 URL: https://issues.apache.org/jira/browse/HIVE-3562
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Navis
>            Assignee: Navis
>            Priority: Trivial
>         Attachments: HIVE-3562.D5967.1.patch, HIVE-3562.D5967.2.patch, HIVE-3562.D5967.3.patch,
HIVE-3562.D5967.4.patch, HIVE-3562.D5967.5.patch, HIVE-3562.D5967.6.patch
>
>
> Queries with limit clause (with reasonable number), for example
> {noformat}
> select * from src order by key limit 10;
> {noformat}
> makes operator tree, 
> TS-SEL-RS-EXT-LIMIT-FS
> But LIMIT can be partially calculated in RS, reducing size of shuffling.
> TS-SEL-RS(TOP-N)-EXT-LIMIT-FS

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message