hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Phabricator (JIRA)" <>
Subject [jira] [Commented] (HIVE-3562) Some limit can be pushed down to map stage
Date Sat, 24 Aug 2013 03:57:52 GMT


Phabricator commented on HIVE-3562:

navis has commented on the revision "HIVE-3562 [jira] Some limit can be pushed down to map

  Thanks for a review.

  common/src/java/org/apache/hadoop/hive/conf/ I thought it's a convention
of hive to make configurations for enable/disable and threshold each, which was requested
several times till now for me for other issues. I'll address that.
  ql/src/java/org/apache/hadoop/hive/ql/exec/ Legend : A(a) -->
key A, value a, row A(a)

  If each mapper takes rows lite this
  MAP1(RS) : 40(a)-10(b)-30(c)-10(d)-70(e)-80(f)
  MAP2(RS) : 90(g)-80(h)-60(i)-40(j)-30(k)-20(l)
  MAP3(RS) : 40(m)-50(n)-30(o)-30(p)-60(q)-70(r)

  REDUCER : 10(b,d)-20(l)-30(c,k,o)-40(a,j,m)-50(n)-60(i,q)-70(e,r)-80(f,h)-90(g)
  REDUCER(LIMIT 3) : 10(b,d)-20(l)-30(c,k,o)

  with this patch

  MAP1 : 40(a)-10(b)-30(c)-10(d)
  MAP2 : 40(j)-30(k)-20(l)
  MAP3 : 40(m)-50(n)-30(o)-30(p)

  REDUCER : 10(b,d)-20(l)-30(c,k,o)-40(a,j,m)-50(n)
  REDUCER(LIMIT 3) : 10(b,d)-20(l)-30(c,k,o)
  ql/src/java/org/apache/hadoop/hive/ql/exec/ I thought limit 0
is not realistic and can be used for some kind of testing. I'll address that case but making
Hash seemed a little much.
  ql/src/java/org/apache/hadoop/hive/ql/exec/ ReduceSinkHash makes
arrays for key/value/hashes. It's for compensating those (not exact but I like the number).
  ql/src/java/org/apache/hadoop/hive/ql/exec/ I think there was
a test for RS operator which has no collector. I'll check that.
  ql/src/java/org/apache/hadoop/hive/ql/exec/ Right. I've missed
  ql/src/java/org/apache/hadoop/hive/ql/exec/ It's possible to
add key/value to hash which is flushed right before. But I just forwarded it.
  ql/src/java/org/apache/hadoop/hive/ql/exec/ Right.
  ql/src/java/org/apache/hadoop/hive/ql/exec/ It's not serializable
class. But I'll add transient mod.
  ql/src/java/org/apache/hadoop/hive/ql/exec/ I'll leave log message
in the processOp() method.
  ql/src/java/org/apache/hadoop/hive/ql/exec/ Return value can
be false only if it's for GBY. False means the key exists already in RS hash.
  For case A(a)->A(b)->B(c) with limit 5, the first A(a) will be stored into hash. For
the second A(b), value b should be stored, which means values should be byte[][][] using additional
array for each index consuming more memory. But I suspected that case could be a common case
with map aggregation. So I just decided to forward A(b) to output collector.
  ql/src/java/org/apache/hadoop/hive/ql/exec/ It does not remove
anything. This is called after index is removed from RS hash.
  ql/src/java/org/apache/hadoop/hive/ql/exec/ Good.
  ql/src/java/org/apache/hadoop/hive/ql/exec/ Main difference of
HashForOrder from HashForGroupBy is that it should store duplicated keys. As described in
comments, A->B->B->C limit 3 should store A,B,B not A,B,C. I believe this behavior
can be possibly implemented with simple TreeSet but thought this is way simpler.
  ql/src/java/org/apache/hadoop/hive/ql/exec/ I prefer Map rather
than Set but I'll change that.
  ql/src/test/queries/clientpositive/limit_pushdown.q:14 ok
  ql/src/test/queries/clientpositive/limit_pushdown.q:43 ok.
  ql/src/java/org/apache/hadoop/hive/ql/optimizer/ This is not
exact description. RS for limit (not last RS). I'll change this, too.
  ql/src/java/org/apache/hadoop/hive/ql/exec/ Looks promising.
Need more investigation on implementation of distinct.




To: JIRA, tarball, ashutoshc, navis
Cc: njain

> Some limit can be pushed down to map stage
> ------------------------------------------
>                 Key: HIVE-3562
>                 URL:
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Navis
>            Assignee: Navis
>            Priority: Trivial
>         Attachments: HIVE-3562.D5967.1.patch, HIVE-3562.D5967.2.patch, HIVE-3562.D5967.3.patch,
HIVE-3562.D5967.4.patch, HIVE-3562.D5967.5.patch, HIVE-3562.D5967.6.patch
> Queries with limit clause (with reasonable number), for example
> {noformat}
> select * from src order by key limit 10;
> {noformat}
> makes operator tree, 
> But LIMIT can be partially calculated in RS, reducing size of shuffling.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see:

View raw message