hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sergey Shelukhin <ser...@hortonworks.com>
Subject Re: Hive Start Up Time Manifolds Greater than Execution Time
Date Fri, 18 Sep 2015 18:48:50 GMT
Actually, on 2nd though, even listing directories (which is necessary to
launch the job) could take long.
If there are any client logs, you can try to take a look to see where the
time is spent.
If you are running under Hive CLI, the logs would be in
/tmp/$USER/hive.log by default.

On 15/9/18, 11:46, "Sergey Shelukhin" <sergey@hortonworks.com> wrote:

>Which version of the Hive, and file format, are you using?
>It could be either reading file footers for ORC - in recent version
>there’s way to disable that (set hive.exec.orc.split.strategy=BI); or
>some similar feature for other formats that I’m not immediately familiar
>with.
>It could also be slow metastore calls.
>
>From: Sreenath <sreenaths1923@gmail.com<mailto:sreenaths1923@gmail.com>>
>Reply-To: "user@hive.apache.org<mailto:user@hive.apache.org>"
><user@hive.apache.org<mailto:user@hive.apache.org>>
>Date: Friday, September 18, 2015 at 02:24
>To: "dev@hive.apache.org<mailto:dev@hive.apache.org>"
><dev@hive.apache.org<mailto:dev@hive.apache.org>>,
>"user@hive.apache.org<mailto:user@hive.apache.org>"
><user@hive.apache.org<mailto:user@hive.apache.org>>
>Subject: Hive Start Up Time Manifolds Greater than Execution Time
>
>Hi All,
>
>Something interesting fell to my notice last day when i was using hive
>for some queries. The time taken by hive to launch a mapreduce job was
>manifolds higher than the time taken by hadoop to actually execute it.
>This is the table details on which the query is being fired.
>
>CREATE EXTERNAL TABLE A
>(
>    user_id string,
>    stage strig,
>    url string
>)
>PARTITIONED BY (dt string , id string)
>
>All the data for table is stored in S3 and each day there will be around
>2000 unique id i.e 2000 partitions being added daily. And we can assume
>that each partition has on a average 100MB gzip compressed data.
>Now when I run a query like "SELECT DISTINCT user_id FROM A  WHERE
>dt>='20150101' and dt <= '20150401'" ie over a period of 3 months approx
>60000 partitions it takes hive approximately 2 hrs to launch the map
>reduce job and the launched job just finishes in 20 min. So was wondering
>if someone can help me in understanding what hive is doing in this 2 hrs ?
>Would really appreciate some help here . Thanks in advance !!!!
>
>
>Best,
>Sreenath
>

Mime
View raw message