hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Utkarsh Raj Goswami (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HIVE-10599) Identify and execute queries in parallel under "hive -f" mode
Date Mon, 04 May 2015 20:15:06 GMT
Utkarsh Raj Goswami created HIVE-10599:
------------------------------------------

             Summary: Identify and execute queries in parallel under "hive -f" mode
                 Key: HIVE-10599
                 URL: https://issues.apache.org/jira/browse/HIVE-10599
             Project: Hive
          Issue Type: New Feature
          Components: Query Planning, Query Processor
            Reporter: Utkarsh Raj Goswami
            Priority: Minor


Currently, hive focuses on identifying jobs(Spark or MapReduce) for a particular query(hive.exec.parallel=true)
and then execute them in parallel mode.

It would be amazing if hive supports similar parallelism at file level i.e. file containing
list of queries. It would reduce the runtime of queries/file drastically if the system is
capable to handle such parallelism.

For example I have following queries in a file:
1) USE my_db;
2) ADD JAR /path/to/my/jar;
3) CREATE TEMPORARY FUNCTION UDFFoo AS 'org.urg.MyUDF';

4) CREATE TABLE IF NOT EXISTS tab1 (col1 STRING, col2 STRING);
5) INSERT OVERWRITE TABLE tab1 SELECT col1, UDFFoo(col2) FROM preExistingTableInDB;

6) CREATE TABLE IF NOT EXISTS tab2(col1 STRING, col2 STRING);
7) INSERT OVERWRITE TABLE tab2 SELECT col1, col2 FROM preExistingTableInDB2;

8) SET some.query.specific.property1=some.value;
9) CREATE TABLE IF NOT EXISTS tab3(col1 STRING);
10) INSERT INTO TABLE tab3 SELECT col1 FROM preExistingTableInDB;

11) SET some.query.specific.property2=some.value;
12) CREATE TABLE IF NOT EXISTS tab4(col1 STRING, col2 STRING)
13) INSERT OVERWRITE TABLE tab4 SELECT B.col1,UDFFoo(B.col2) FROM tab1 A JOIN tab2 B ON (A.col1=B.col1);


Effectively, if analysed , parallelism can be achieved with following execution levels

LEVEL-1 : Execute(1,2,3,4) , Execute(1,2,3,6) , Execute(1,2,3,8,9) , Execute(1,2,3,8,11,12)
LEVEL-2 : Execute(1,2,3,5) , Execute(1,2,3,7) , Execute(1,2,3,8,10)
LEVEL-3 : Execute(1,2,3,8,11,13)

All the *Execute* expressions can be executed in parallel for a level. 

These parallel execution levels can be identified by analyzing the dependencies between the
queries/tables.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message