hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gopal Vijayaraghavan <gop...@apache.org>
Subject Re: Clustering and Large-scale analysis of Hive Queries
Date Fri, 03 Aug 2018 18:40:43 GMT

> I am interested in working on a project that takes a large number of Hive queries (as
well as their meta data like amount of resources used etc) and find out common sub queries
and expensive query groups etc.

This was roughly the central research topic of one of the Hive CBO devs, except was implemented
for PIG (not Hive).

https://hal.inria.fr/hal-01353891
+
https://github.com/jcamachor/pigreuse

I think there's a lot of interest in this topic for ETL workloads and the goal is to pick
this up as ETL becomes the target problem.

There's a recent SIGMOID paper which talks about the same sort of reuse.

https://www.microsoft.com/en-us/research/uploads/prod/2018/03/cloudviews-sigmod2018.pdf

If you are interested in looking into this using existing infra in Hive, I recommend looking
at Zoltan's recent work which tracks query plans + runtime statistics from the RUNTIME_STATS
table in the metastore.

You can debug through what this does by doing

"explain reoptimization  <query>;"

Cheers,
Gopal



Mime
View raw message