hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sungwoo Park <glap...@gmail.com>
Subject Fwd: Hive generating different DAGs from the same query
Date Fri, 20 Jul 2018 04:03:14 GMT
Hello Zoltan,

I further tested, and found no Exception (such as
MapJoinMemoryExhaustionError) during the run. So, the query ran fine. My
conclusion is that a query can update some internal states of HiveServer2,
affecting DAG generation for subsequent queries. Moreover, the same query
may or may not affect DAG generation.

This issue is not related to query reexecution, as even with query
reexecution disabled (hive.query.reexecution.enabled set to false), I still
see this problem occurring.

--- Sungwoo Park

On Fri, Jul 13, 2018 at 4:48 PM, Zoltan Haindrich <kirk@rxd.hu> wrote:

> Hello Sungwoo!
>
> I think its possible that reoptimization is kicking in, because the first
> execution have bumped into an exception.
>
> I think the plans should not be changing permanently; unless
> "hive.query.reexecution.stats.persist.scope" is set to a wider scope than
> query.
>
> To check that indeed reoptimization is happening(or not) look for:
>
> cat > patterns << EOF
> org.apache.hadoop.hive.ql.exec.mapjoin.MapJoinMemoryExhaustionError
> reexec
> Driver.java:execute
> SessionState.java:printError
> EOF
>
> cat patterns
>
> fgrep -Ff patterns --color=yes /var/log/hive/hiveserver2.log | grep -v
> DEBUG
>
> cheers,
> Zoltan
>
>
> On 07/11/2018 10:40 AM, Sungwoo Park wrote:
>
>> Hello,
>>
>> I am running the TPC-DS benchmark using Hive 3.0, and I find that Hive
>> sometimes produces different DAGs from the same query. These are the two
>> scenarios for the experiment. The execution engine is tez, and the TPC-DS
>> scale factor is 3TB.
>>
>> 1. Run query 19 to query 24 sequentially in the same session. The first
>> part of query 24 takes about 156 seconds:
>>
>> 100 rows selected (58.641 seconds) <-- query 19
>> 100 rows selected (16.117 seconds)
>> 100 rows selected (9.841 seconds)
>> 100 rows selected (35.195 seconds)
>> 1 row selected (258.441 seconds)
>> 59 rows selected (213.156 seconds)
>> 4,643 rows selected (156.982 seconds) <-- the first part of query 24
>> 1,656 rows selected (136.382 seconds)
>>
>> 2. Now run query 1 to query 24 sequentially in the same session. This
>> time the first part of query 24 takes more than 1000 seconds:
>>
>> 100 rows selected (94.981 seconds) <-- query 1
>> 2,513 rows selected (30.804 seconds)
>> 100 rows selected (11.076 seconds)
>> 100 rows selected (225.646 seconds)
>> 100 rows selected (44.186 seconds)
>> 52 rows selected (11.436 seconds)
>> 100 rows selected (21.968 seconds)
>> 11 rows selected (14.05 seconds)
>> 1 row selected (35.619 seconds)
>> 100 rows selected (27.062 seconds)
>> 100 rows selected (134.098 seconds)
>> 100 rows selected (7.65 seconds)
>> 1 row selected (14.54 seconds)
>> 100 rows selected (143.965 seconds)
>> 100 rows selected (101.676 seconds)
>> 100 rows selected (19.742 seconds)
>> 1 row selected (245.381 seconds)
>> 100 rows selected (71.617 seconds)
>> 100 rows selected (23.017 seconds)
>> 100 rows selected (10.888 seconds)
>> 100 rows selected (11.149 seconds)
>> 100 rows selected (7.919 seconds)
>> 100 rows selected (29.527 seconds)
>> 1 row selected (220.516 seconds)
>> 59 rows selected (204.363 seconds)
>> 4,643 rows selected (1008.514 seconds) <-- the first part of query 24
>> 1,656 rows selected (141.279 seconds)
>>
>> Here are a few findings from the experiment:
>>
>> 1. The two DAGs for the first part of query 24 are quite similar, but
>> actually different. The DAG from the first scenario contains 17 vertices,
>> whereas the DAG from the second scenario contains 18 vertices, skipping
>> some part of map-side join that is performed in the first scenario.
>>
>> 2. The configuration (HiveConf) inside HiveServer2 is precisely the same
>> before running the first part of query 24 (except for minor keys).
>>
>> So, I wonder how Hive can produce different DAGs from the same query. For
>> example, is there some internal configuration key in HiveConf that
>> enables/disables some optimization depending on the accumulate statistics
>> in HiveServer2? (I haven't tested it yet, but I can also test with Hive
>> 2.x.)
>>
>> Thank you in advance,
>>
>> --- Sungwoo Park
>>
>>

Mime
View raw message