hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lars Francke <>
Subject Question about Utilities#getInputPaths
Date Wed, 05 Oct 2016 13:21:12 GMT
Hi everyone,

I've encountered a performance issue at multiple customers now. The problem
is the processing of input paths when there are lots of partitions.

We check each directory if it's empty. This alone can take minutes.

There is a comment in Utilities:

"We need to add a empty file, it is not acceptable to change the operator
Consider the query:
select * from (select count(1) from T union all select count(1) from T2) x;

If T is empty and T2 contains 100 rows, the user expects: 0, 100 (2 rows)"

I have to admit that I don't quite understand that. Would it mean that we'd
only get a single row if we left out this empty path?

I do not understand the internals of query planning and execution well
enough but if someone has time to explain it to me I'd be very grateful.
(If someone who understands all of this is based in Europe I'd be more than
happy to jump on a call as well or invite to a steak & beer ;-) )

If that is indeed the reason: This code was written 3 or 4 years ago. Maybe
internals have changed enough so that we can now deal with this? For simple
queries like SELECT * FROM T LIMIT 10 I'm seeing 5-10min runtimes just
because of this overhead.



  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message