hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Phabricator (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-3562) Some limit can be pushed down to map stage
Date Thu, 22 Aug 2013 16:20:54 GMT

    [ https://issues.apache.org/jira/browse/HIVE-3562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13747626#comment-13747626
] 

Phabricator commented on HIVE-3562:
-----------------------------------

ashutoshc has requested changes to the revision "HIVE-3562 [jira] Some limit can be pushed
down to map stage".

INLINE COMMENTS
  common/src/java/org/apache/hadoop/hive/conf/HiveConf.java:528 I think we don't need a boolean
flag. If user has set hive.limit.pushdown.memory.usage=0 it is pretty clear he wants the optimization
to be disabled.
  ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java:1187 I didnt fully follow
the example here. What are the keys/values/rows here?
  ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:154 select key from src
limit 0; implies user doesn't want any data, so RS doesnt need to emit any thing, this optimization
is still valid and super useful here. Isn't it? So, this should be limit < 0, instead limit
<=0
  ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:157 Can you add a comment
why you are subtracting limit * 64 in this calculation of threshold?
  ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:171 I think we need not
to special handle limit = 0 by null'ing the collector. As I said above better is to make RSHash
to null in that case. This avoids a null check of collector in processOp().
  ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:325 As I noted above,
lets not have this null check. Apart from you null'ing the collector above, could there ever
be case when it will be null. I dont think so, in that case lets remove this if-branch from
inner loop.
  ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:340 Shall this be instead
collect(keyWritable, value) ?
  ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:343 Didn't get your comment
for retry here. Can you add more comment for it?
  ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:346 collect() instead
of out.collect() ?
  ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:434-436 transient?
  ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:471 It will be good to
add a LOG.DEBUG() here saying, disabling ReduceSinkHash.
  ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:477 I didn't get your
TODO here. Can you explain?
  ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:495 Better name: remove
?
  ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:498 Add a comment here
saying values[index] could be null, if flush() was called after this value is added but before
remove() is called.
  ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:535 You are using PriorityQueue
almost like a sorted set.  I think java.util.TreeSet will suffice what you need here.
  ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:566 I dont see you making
use of any feature corresponding to Map interface. Looks like TreeSet would be sufficient
here as well.

  If it so that at both places we can use TreeSet than we do need not these two diff implementation
of ReduceSinkHash. This will simplify code quite a bit.
  ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:253-264 This handling
of distinct keys case can also be refactored to use ReduceSinkHash. Will be a good idea to
take this up in a follow up jira.
  ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:535 Also need to mark
it transient.
  ql/src/java/org/apache/hadoop/hive/ql/optimizer/LimitPushdownOptimizer.java:91 If we are
able to use one DataStructure in ReduceSinkHash (either TreeSet or some other) for both cases,
we wont need this boolean.
  ql/src/java/org/apache/hadoop/hive/ql/optimizer/LimitPushdownOptimizer.java:46 How are enforcing
this condition of applying this optimization only for last RS. Can you add comments about
this.
  ql/src/test/queries/clientpositive/limit_pushdown.q:14 Can you also add a test for following
query with this optimization on:
  explain select distinct(key) from src limit 20;
  explain select count(distinct(key)) from src group by key limit 20;

  select distinct(key) from src limit 20;
  select count(distinct(key)) from src group by key limit 20;
  ql/src/test/queries/clientpositive/limit_pushdown.q:43 It will be good to seprate these
queries where optimization will not be fired in a seprate .q file.

REVISION DETAIL
  https://reviews.facebook.net/D5967

BRANCH
  HIVE-3562

ARCANIST PROJECT
  hive

To: JIRA, tarball, ashutoshc, navis
Cc: njain

                
> Some limit can be pushed down to map stage
> ------------------------------------------
>
>                 Key: HIVE-3562
>                 URL: https://issues.apache.org/jira/browse/HIVE-3562
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Navis
>            Assignee: Navis
>            Priority: Trivial
>         Attachments: HIVE-3562.D5967.1.patch, HIVE-3562.D5967.2.patch, HIVE-3562.D5967.3.patch,
HIVE-3562.D5967.4.patch, HIVE-3562.D5967.5.patch, HIVE-3562.D5967.6.patch
>
>
> Queries with limit clause (with reasonable number), for example
> {noformat}
> select * from src order by key limit 10;
> {noformat}
> makes operator tree, 
> TS-SEL-RS-EXT-LIMIT-FS
> But LIMIT can be partially calculated in RS, reducing size of shuffling.
> TS-SEL-RS(TOP-N)-EXT-LIMIT-FS

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message