hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hive QA (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-3286) Explicit skew join on user provided condition
Date Thu, 28 Nov 2013 01:20:36 GMT

    [ https://issues.apache.org/jira/browse/HIVE-3286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13834436#comment-13834436
] 

Hive QA commented on HIVE-3286:
-------------------------------



{color:red}Overall{color}: -1 no tests executed

Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12616156/HIVE-3286.13.patch.txt

Test results: http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/472/testReport
Console output: http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/472/console

Messages:
{noformat}
**** This message was trimmed, see log for full details ****
As a result, alternative(s) 2 were disabled for that input
warning(200): IdentifiersParser.g:68:4: 
Decision can match input such as "LPAREN KW_CASE TinyintLiteral" using multiple alternatives:
1, 2

As a result, alternative(s) 2 were disabled for that input
warning(200): IdentifiersParser.g:68:4: 
Decision can match input such as "LPAREN KW_NULL GREATERTHAN" using multiple alternatives:
1, 2

As a result, alternative(s) 2 were disabled for that input
warning(200): IdentifiersParser.g:68:4: 
Decision can match input such as "LPAREN KW_NOT DecimalLiteral" using multiple alternatives:
1, 2

As a result, alternative(s) 2 were disabled for that input
warning(200): IdentifiersParser.g:68:4: 
Decision can match input such as "LPAREN KW_CASE DecimalLiteral" using multiple alternatives:
1, 2

As a result, alternative(s) 2 were disabled for that input
warning(200): IdentifiersParser.g:108:5: 
Decision can match input such as "KW_ORDER KW_BY LPAREN" using multiple alternatives: 1, 2

As a result, alternative(s) 2 were disabled for that input
warning(200): IdentifiersParser.g:121:5: 
Decision can match input such as "KW_CLUSTER KW_BY LPAREN" using multiple alternatives: 1,
2

As a result, alternative(s) 2 were disabled for that input
warning(200): IdentifiersParser.g:133:5: 
Decision can match input such as "KW_PARTITION KW_BY LPAREN" using multiple alternatives:
1, 2

As a result, alternative(s) 2 were disabled for that input
warning(200): IdentifiersParser.g:144:5: 
Decision can match input such as "KW_DISTRIBUTE KW_BY LPAREN" using multiple alternatives:
1, 2

As a result, alternative(s) 2 were disabled for that input
warning(200): IdentifiersParser.g:155:5: 
Decision can match input such as "KW_SORT KW_BY LPAREN" using multiple alternatives: 1, 2

As a result, alternative(s) 2 were disabled for that input
warning(200): IdentifiersParser.g:172:7: 
Decision can match input such as "STAR" using multiple alternatives: 1, 2

As a result, alternative(s) 2 were disabled for that input
warning(200): IdentifiersParser.g:185:5: 
Decision can match input such as "KW_ARRAY" using multiple alternatives: 2, 6

As a result, alternative(s) 6 were disabled for that input
warning(200): IdentifiersParser.g:185:5: 
Decision can match input such as "KW_UNIONTYPE" using multiple alternatives: 5, 6

As a result, alternative(s) 6 were disabled for that input
warning(200): IdentifiersParser.g:185:5: 
Decision can match input such as "KW_STRUCT" using multiple alternatives: 4, 6

As a result, alternative(s) 6 were disabled for that input
warning(200): IdentifiersParser.g:267:5: 
Decision can match input such as "KW_NULL" using multiple alternatives: 1, 8

As a result, alternative(s) 8 were disabled for that input
warning(200): IdentifiersParser.g:267:5: 
Decision can match input such as "KW_FALSE" using multiple alternatives: 3, 8

As a result, alternative(s) 8 were disabled for that input
warning(200): IdentifiersParser.g:267:5: 
Decision can match input such as "KW_TRUE" using multiple alternatives: 3, 8

As a result, alternative(s) 8 were disabled for that input
warning(200): IdentifiersParser.g:267:5: 
Decision can match input such as "KW_DATE StringLiteral" using multiple alternatives: 2, 3

As a result, alternative(s) 3 were disabled for that input
warning(200): IdentifiersParser.g:399:5: 
Decision can match input such as "{KW_LIKE, KW_REGEXP, KW_RLIKE} KW_SORT KW_BY" using multiple
alternatives: 2, 9

As a result, alternative(s) 9 were disabled for that input
warning(200): IdentifiersParser.g:399:5: 
Decision can match input such as "{KW_LIKE, KW_REGEXP, KW_RLIKE} KW_GROUP KW_BY" using multiple
alternatives: 2, 9

As a result, alternative(s) 9 were disabled for that input
warning(200): IdentifiersParser.g:399:5: 
Decision can match input such as "{KW_LIKE, KW_REGEXP, KW_RLIKE} KW_INSERT KW_OVERWRITE" using
multiple alternatives: 2, 9

As a result, alternative(s) 9 were disabled for that input
warning(200): IdentifiersParser.g:399:5: 
Decision can match input such as "{KW_LIKE, KW_REGEXP, KW_RLIKE} KW_MAP LPAREN" using multiple
alternatives: 2, 9

As a result, alternative(s) 9 were disabled for that input
warning(200): IdentifiersParser.g:399:5: 
Decision can match input such as "{KW_LIKE, KW_REGEXP, KW_RLIKE} KW_ORDER KW_BY" using multiple
alternatives: 2, 9

As a result, alternative(s) 9 were disabled for that input
warning(200): IdentifiersParser.g:399:5: 
Decision can match input such as "KW_BETWEEN KW_MAP LPAREN" using multiple alternatives: 8,
9

As a result, alternative(s) 9 were disabled for that input
warning(200): IdentifiersParser.g:399:5: 
Decision can match input such as "{KW_LIKE, KW_REGEXP, KW_RLIKE} KW_LATERAL KW_VIEW" using
multiple alternatives: 2, 9

As a result, alternative(s) 9 were disabled for that input
warning(200): IdentifiersParser.g:399:5: 
Decision can match input such as "{KW_LIKE, KW_REGEXP, KW_RLIKE} KW_CLUSTER KW_BY" using multiple
alternatives: 2, 9

As a result, alternative(s) 9 were disabled for that input
warning(200): IdentifiersParser.g:399:5: 
Decision can match input such as "{KW_LIKE, KW_REGEXP, KW_RLIKE} KW_INSERT KW_INTO" using
multiple alternatives: 2, 9

As a result, alternative(s) 9 were disabled for that input
warning(200): IdentifiersParser.g:399:5: 
Decision can match input such as "{KW_LIKE, KW_REGEXP, KW_RLIKE} KW_DISTRIBUTE KW_BY" using
multiple alternatives: 2, 9

As a result, alternative(s) 9 were disabled for that input
warning(200): IdentifiersParser.g:524:5: 
Decision can match input such as "{AMPERSAND..BITWISEXOR, DIV..DIVIDE, EQUAL..EQUAL_NS, GREATERTHAN..GREATERTHANOREQUALTO,
KW_AND, KW_ARRAY, KW_BETWEEN..KW_BOOLEAN, KW_CASE, KW_DOUBLE, KW_FLOAT, KW_IF, KW_IN, KW_INT,
KW_LIKE, KW_MAP, KW_NOT, KW_OR, KW_REGEXP, KW_RLIKE, KW_SMALLINT, KW_STRING..KW_STRUCT, KW_TINYINT,
KW_UNIONTYPE, KW_WHEN, LESSTHAN..LESSTHANOREQUALTO, MINUS..NOTEQUAL, PLUS, STAR, TILDE}" using
multiple alternatives: 1, 3

As a result, alternative(s) 3 were disabled for that input
[INFO] 
[INFO] --- maven-resources-plugin:2.5:resources (default-resources) @ hive-exec ---
[debug] execute contextualize
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] Copying 1 resource
[INFO] 
[INFO] --- maven-antrun-plugin:1.7:run (define-classpath) @ hive-exec ---
[INFO] Executing tasks

main:
[INFO] Executed tasks
[INFO] 
[INFO] --- maven-compiler-plugin:3.1:compile (default-compile) @ hive-exec ---
[INFO] Compiling 1400 source files to /data/hive-ptest/working/apache-svn-trunk-source/ql/target/classes
[INFO] -------------------------------------------------------------
[WARNING] COMPILATION WARNING : 
[INFO] -------------------------------------------------------------
[WARNING] Note: Some input files use or override a deprecated API.
[WARNING] Note: Recompile with -Xlint:deprecation for details.
[WARNING] Note: Some input files use unchecked or unsafe operations.
[WARNING] Note: Recompile with -Xlint:unchecked for details.
[INFO] 4 warnings 
[INFO] -------------------------------------------------------------
[INFO] -------------------------------------------------------------
[ERROR] COMPILATION ERROR : 
[INFO] -------------------------------------------------------------
[ERROR] /data/hive-ptest/working/apache-svn-trunk-source/ql/src/java/org/apache/hadoop/hive/ql/plan/SkewContext.java:[118,49]
cannot find symbol
symbol  : method getRandom()
location: class org.apache.hadoop.hive.ql.io.HiveKey
[INFO] 1 error
[INFO] -------------------------------------------------------------
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO] 
[INFO] Hive .............................................. SUCCESS [2.846s]
[INFO] Hive Ant Utilities ................................ SUCCESS [9.072s]
[INFO] Hive Shims Common ................................. SUCCESS [3.459s]
[INFO] Hive Shims 0.20 ................................... SUCCESS [2.210s]
[INFO] Hive Shims Secure Common .......................... SUCCESS [2.711s]
[INFO] Hive Shims 0.20S .................................. SUCCESS [1.397s]
[INFO] Hive Shims 0.23 ................................... SUCCESS [3.679s]
[INFO] Hive Shims ........................................ SUCCESS [3.073s]
[INFO] Hive Common ....................................... SUCCESS [13.388s]
[INFO] Hive Serde ........................................ SUCCESS [11.713s]
[INFO] Hive Metastore .................................... SUCCESS [26.188s]
[INFO] Hive Query Language ............................... FAILURE [30.091s]
[INFO] Hive Service ...................................... SKIPPED
[INFO] Hive JDBC ......................................... SKIPPED
[INFO] Hive Beeline ...................................... SKIPPED
[INFO] Hive CLI .......................................... SKIPPED
[INFO] Hive Contrib ...................................... SKIPPED
[INFO] Hive HBase Handler ................................ SKIPPED
[INFO] Hive HCatalog ..................................... SKIPPED
[INFO] Hive HCatalog Core ................................ SKIPPED
[INFO] Hive HCatalog Pig Adapter ......................... SKIPPED
[INFO] Hive HCatalog Server Extensions ................... SKIPPED
[INFO] Hive HCatalog Webhcat Java Client ................. SKIPPED
[INFO] Hive HCatalog Webhcat ............................. SKIPPED
[INFO] Hive HCatalog HBase Storage Handler ............... SKIPPED
[INFO] Hive HWI .......................................... SKIPPED
[INFO] Hive ODBC ......................................... SKIPPED
[INFO] Hive Shims Aggregator ............................. SKIPPED
[INFO] Hive TestUtils .................................... SKIPPED
[INFO] Hive Packaging .................................... SKIPPED
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 1:52.444s
[INFO] Finished at: Wed Nov 27 20:19:46 EST 2013
[INFO] Final Memory: 51M/371M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.1:compile
(default-compile) on project hive-exec: Compilation failure
[ERROR] /data/hive-ptest/working/apache-svn-trunk-source/ql/src/java/org/apache/hadoop/hive/ql/plan/SkewContext.java:[118,49]
cannot find symbol
[ERROR] symbol  : method getRandom()
[ERROR] location: class org.apache.hadoop.hive.ql.io.HiveKey
[ERROR] -> [Help 1]
[ERROR] 
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR] 
[ERROR] For more information about the errors and possible solutions, please read the following
articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException
[ERROR] 
[ERROR] After correcting the problems, you can resume the build with the command
[ERROR]   mvn <goals> -rf :hive-exec
+ exit 1
'
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12616156

> Explicit skew join on user provided condition
> ---------------------------------------------
>
>                 Key: HIVE-3286
>                 URL: https://issues.apache.org/jira/browse/HIVE-3286
>             Project: Hive
>          Issue Type: Improvement
>          Components: Query Processor
>            Reporter: Navis
>            Assignee: Navis
>            Priority: Minor
>         Attachments: D4287.11.patch, HIVE-3286.12.patch.txt, HIVE-3286.13.patch.txt,
HIVE-3286.D4287.10.patch, HIVE-3286.D4287.5.patch, HIVE-3286.D4287.6.patch, HIVE-3286.D4287.7.patch,
HIVE-3286.D4287.8.patch, HIVE-3286.D4287.9.patch
>
>
> Join operation on table with skewed data takes most of execution time handling the skewed
keys. But mostly we already know about that and even know what is look like the skewed keys.
> If we can explicitly assign reducer slots for the skewed keys, total execution time could
be greatly shortened.
> As for a start, I've extended join grammar something like this.
> {code}
> select * from src a join src b on a.key=b.key skew on (a.key+1 < 50, a.key+1 <
100, a.key < 150);
> {code}
> which means if above query is executed by 20 reducers, one reducer for a.key+1 < 50,
one reducer for 50 <= a.key+1 < 100, one reducer for 99 <= a.key < 150, and 17
reducers for others (could be extended to assign more than one reducer later)
> This can be only used with common-inner-equi joins. And skew condition should be composed
of join keys only.
> Work till done now will be updated shortly after code cleanup.
> ----------------------------
> Skew expressions* in "SKEW ON (expr, expr, ...)" are evaluated sequentially at runtime,
and first 'true' one decides skew group for the row. Each skew group has reserved partition
slot(s), to which all rows in a group would be assigned. 
> The number of partition slot reserved for each group is decided also at runtime by simple
calculation of percentage. If a skew group is "CLUSTER BY 20 PERCENT" and total partition
slot (=number of reducer) is 20, that group will reserve 4 partition slots, etc.
> "DISTRIBUTE BY" decides how the rows in a group is dispersed in the range of reserved
slots (If there is only one slot for a group, this is meaningless). Currently, three distribution
policies are available: RANDOM, KEYS, <expression>. 
> 1. RANDOM : rows of driver** alias are dispersed by random and rows of non-driver alias
are duplicated for all the slots (default if not specified)
> 2. KEYS : determined by hash value of keys (same with previous)
> 3. expression : determined by hash of object evaluated by user-provided expression
> Only possible with inner, equi, common-joins. Not yet supports join tree merging.
> Might be used by other RS users like "SORT BY" or "GROUP BY"
> If there exists column statistics for the key, it could be possible to apply automatically.
> For example, if 20 reducers are used for the query below,
> {code}
> select count(*) from src a join src b on a.key=b.key skew on (
>    a.key = '0' CLUSTER BY 10 PERCENT,
>    b.key < '100' CLUSTER BY 20 PERCENT DISTRIBUTE BY upper(b.key),
>    cast(a.key as int) > 300 CLUSTER BY 40 PERCENT DISTRIBUTE BY KEYS);
> {code}
> group-0 will reserve slots 6~7, group-1 8~11, group-2 12~19 and others will reserve slots
0~5.
> For a row with key='0' from alias a, the row is randomly assigned in the range of 6~7
(driver alias) : 6 or 7
> For a row with key='0' from alias b, the row is disributed for all slots in 6~7 (non-driver
alias) : 6 and 7
> For a row with key='50', the row is assigned in the range of 8~11 by hashcode of upper(b.key)
: 8 + (hash(upper(key)) % 4)
> For a row with key='500', the row is assigned in the range of 12~19 by hashcode of join
key : 12 + (hash(key) % 8)
> For a row with key='200', this is not belong to any skew group : hash(key) % 6
> *expressions in skew condition : 
> 1. all expressions should be made of expression in join condition, which means if join
condition is "a.key=b.key", user can make any expression with "a.key" or "b.key". But if join
condition is a.key+1=b.key, user cannot make expression with "a.key" solely (should make expression
with "a.key+1"). 
> 2. all expressions should reference one and only-one side of aliases. For example, simple
constant expressions or expressions referencing both side of join condition ("a.key+b.key<100")
is not allowed.
> 3. all functions in expression should be deteministic and stateless.
> 4. if "DISTRIBUTED BY expression" is used, distibution expression also should have same
alias with skew expression.
> **driver alias :
> 1. driver alias means the sole referenced alias from skew expression, which is important
for RANDOM distribution. rows of driver alias are assigned to single slot randomly, but rows
of non-driver alias are duplicated for all the slots. So, driver alias should be the biggest
one in join aliases.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Mime
View raw message