hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Phabricator (JIRA)" <>
Subject [jira] [Commented] (HIVE-3562) Some limit can be pushed down to map stage
Date Thu, 22 Aug 2013 16:20:54 GMT


Phabricator commented on HIVE-3562:

ashutoshc has requested changes to the revision "HIVE-3562 [jira] Some limit can be pushed
down to map stage".

  common/src/java/org/apache/hadoop/hive/conf/ I think we don't need a boolean
flag. If user has set hive.limit.pushdown.memory.usage=0 it is pretty clear he wants the optimization
to be disabled.
  ql/src/java/org/apache/hadoop/hive/ql/exec/ I didnt fully follow
the example here. What are the keys/values/rows here?
  ql/src/java/org/apache/hadoop/hive/ql/exec/ select key from src
limit 0; implies user doesn't want any data, so RS doesnt need to emit any thing, this optimization
is still valid and super useful here. Isn't it? So, this should be limit < 0, instead limit
  ql/src/java/org/apache/hadoop/hive/ql/exec/ Can you add a comment
why you are subtracting limit * 64 in this calculation of threshold?
  ql/src/java/org/apache/hadoop/hive/ql/exec/ I think we need not
to special handle limit = 0 by null'ing the collector. As I said above better is to make RSHash
to null in that case. This avoids a null check of collector in processOp().
  ql/src/java/org/apache/hadoop/hive/ql/exec/ As I noted above,
lets not have this null check. Apart from you null'ing the collector above, could there ever
be case when it will be null. I dont think so, in that case lets remove this if-branch from
inner loop.
  ql/src/java/org/apache/hadoop/hive/ql/exec/ Shall this be instead
collect(keyWritable, value) ?
  ql/src/java/org/apache/hadoop/hive/ql/exec/ Didn't get your comment
for retry here. Can you add more comment for it?
  ql/src/java/org/apache/hadoop/hive/ql/exec/ collect() instead
of out.collect() ?
  ql/src/java/org/apache/hadoop/hive/ql/exec/ transient?
  ql/src/java/org/apache/hadoop/hive/ql/exec/ It will be good to
add a LOG.DEBUG() here saying, disabling ReduceSinkHash.
  ql/src/java/org/apache/hadoop/hive/ql/exec/ I didn't get your
TODO here. Can you explain?
  ql/src/java/org/apache/hadoop/hive/ql/exec/ Better name: remove
  ql/src/java/org/apache/hadoop/hive/ql/exec/ Add a comment here
saying values[index] could be null, if flush() was called after this value is added but before
remove() is called.
  ql/src/java/org/apache/hadoop/hive/ql/exec/ You are using PriorityQueue
almost like a sorted set.  I think java.util.TreeSet will suffice what you need here.
  ql/src/java/org/apache/hadoop/hive/ql/exec/ I dont see you making
use of any feature corresponding to Map interface. Looks like TreeSet would be sufficient
here as well.

  If it so that at both places we can use TreeSet than we do need not these two diff implementation
of ReduceSinkHash. This will simplify code quite a bit.
  ql/src/java/org/apache/hadoop/hive/ql/exec/ This handling
of distinct keys case can also be refactored to use ReduceSinkHash. Will be a good idea to
take this up in a follow up jira.
  ql/src/java/org/apache/hadoop/hive/ql/exec/ Also need to mark
it transient.
  ql/src/java/org/apache/hadoop/hive/ql/optimizer/ If we are
able to use one DataStructure in ReduceSinkHash (either TreeSet or some other) for both cases,
we wont need this boolean.
  ql/src/java/org/apache/hadoop/hive/ql/optimizer/ How are enforcing
this condition of applying this optimization only for last RS. Can you add comments about
  ql/src/test/queries/clientpositive/limit_pushdown.q:14 Can you also add a test for following
query with this optimization on:
  explain select distinct(key) from src limit 20;
  explain select count(distinct(key)) from src group by key limit 20;

  select distinct(key) from src limit 20;
  select count(distinct(key)) from src group by key limit 20;
  ql/src/test/queries/clientpositive/limit_pushdown.q:43 It will be good to seprate these
queries where optimization will not be fired in a seprate .q file.




To: JIRA, tarball, ashutoshc, navis
Cc: njain

> Some limit can be pushed down to map stage
> ------------------------------------------
>                 Key: HIVE-3562
>                 URL:
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Navis
>            Assignee: Navis
>            Priority: Trivial
>         Attachments: HIVE-3562.D5967.1.patch, HIVE-3562.D5967.2.patch, HIVE-3562.D5967.3.patch,
HIVE-3562.D5967.4.patch, HIVE-3562.D5967.5.patch, HIVE-3562.D5967.6.patch
> Queries with limit clause (with reasonable number), for example
> {noformat}
> select * from src order by key limit 10;
> {noformat}
> makes operator tree, 
> But LIMIT can be partially calculated in RS, reducing size of shuffling.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see:

View raw message