From Sergey Shelukhin <>
Subject Re: Hive Start Up Time Manifolds Greater than Execution Time
Date Fri, 18 Sep 2015 18:48:50 GMT
Actually, on 2nd though, even listing directories (which is necessary to
launch the job) could take long.
If there are any client logs, you can try to take a look to see where the
time is spent.
If you are running under Hive CLI, the logs would be in
/tmp/$USER/hive.log by default.

On 15/9/18, 11:46, "Sergey Shelukhin" <> wrote:

>Which version of the Hive, and file format, are you using?
>It could be either reading file footers for ORC - in recent version
>there’s way to disable that (set hive.exec.orc.split.strategy=BI); or
>some similar feature for other formats that I’m not immediately familiar
>It could also be slow metastore calls.
>From: Sreenath <<>>
>Reply-To: "<>"
>Date: Friday, September 18, 2015 at 02:24
>To: "<>"
>Subject: Hive Start Up Time Manifolds Greater than Execution Time
>Hi All,
>Something interesting fell to my notice last day when i was using hive
>for some queries. The time taken by hive to launch a mapreduce job was
>manifolds higher than the time taken by hadoop to actually execute it.
>This is the table details on which the query is being fired.
>    user_id string,
>    stage strig,
>    url string
>PARTITIONED BY (dt string , id string)
>All the data for table is stored in S3 and each day there will be around
>2000 unique id i.e 2000 partitions being added daily. And we can assume
>that each partition has on a average 100MB gzip compressed data.
>Now when I run a query like "SELECT DISTINCT user_id FROM A  WHERE
>dt>='20150101' and dt <= '20150401'" ie over a period of 3 months approx
>60000 partitions it takes hive approximately 2 hrs to launch the map
>reduce job and the launched job just finishes in 20 min. So was wondering
>if someone can help me in understanding what hive is doing in this 2 hrs ?
>Would really appreciate some help here . Thanks in advance !!!!

