Return-Path: X-Original-To: apmail-hive-user-archive@www.apache.org Delivered-To: apmail-hive-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id D4EC111202 for ; Fri, 18 Jul 2014 14:26:44 +0000 (UTC) Received: (qmail 573 invoked by uid 500); 18 Jul 2014 14:26:43 -0000 Delivered-To: apmail-hive-user-archive@hive.apache.org Received: (qmail 500 invoked by uid 500); 18 Jul 2014 14:26:43 -0000 Mailing-List: contact user-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hive.apache.org Delivered-To: mailing list user@hive.apache.org Received: (qmail 490 invoked by uid 99); 18 Jul 2014 14:26:43 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 18 Jul 2014 14:26:43 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of edlinuxguru@gmail.com designates 74.125.82.179 as permitted sender) Received: from [74.125.82.179] (HELO mail-we0-f179.google.com) (74.125.82.179) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 18 Jul 2014 14:26:38 +0000 Received: by mail-we0-f179.google.com with SMTP id u57so4626160wes.10 for ; Fri, 18 Jul 2014 07:26:14 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=k/Pj/qGem6oFWO4CDJkGIHLzzb3u4v6zeU+k1zdecHM=; b=uNUzBM9pfOkziUjlShOpeb0EN0aM6O/cktAmLWWup2VChYgXaZLOsTEHHd2xraKBAh irSZZFBeBA7ibg0ZsIicQYHncwxeTBR3dHXEhnhYM5LUA02PGS3r5Pq5OJrG4qr5ak51 PqEWLjEy2mmOKWFH3X/XLyyTm3Q2iv1QRVUV3jkFiHOF8HA2Y8Z6ZoN9siBQgbZcSViU sfpBIQQLQPt65JTkdHSSkpbYldvwVo1QSg/ILGBdKqdaTlkyi2gHO7u99Y7kAWqw8lE1 2T0CcnUQ0V+pHz/s/hzlmxlVSUcOddypg6tMHaDzn6XnsQD+R0Tpfvkdk1FgLxIbBJi4 PBFQ== MIME-Version: 1.0 X-Received: by 10.180.218.72 with SMTP id pe8mr33583363wic.63.1405693574750; Fri, 18 Jul 2014 07:26:14 -0700 (PDT) Received: by 10.194.88.100 with HTTP; Fri, 18 Jul 2014 07:26:14 -0700 (PDT) In-Reply-To: References: Date: Fri, 18 Jul 2014 10:26:14 -0400 Message-ID: Subject: Re: Hive huge 'startup time' From: Edward Capriolo To: "user@hive.apache.org" Content-Type: multipart/alternative; boundary=001a1134c5ce2fa9c804fe788a4b X-Virus-Checked: Checked by ClamAV on apache.org --001a1134c5ce2fa9c804fe788a4b Content-Type: text/plain; charset=UTF-8 The planning phase needs to do work for every hive partition and every hadoop files. If you have a lot of 'small' files or many partitions this can take a long time. Also the planning phase that happens on the job tracker is single threaded. Also the new yarn stuff requires back and forth to allocated containers. Sometimes raising the heap to for the hive-cli/launching process helps because the default heap of 1 GB may not be a lot of space to deal with all of the partition information and memory overhead will make this go faster. Sometimes setting the min split size higher launches less map tasks which speeds up everything. So the answer...Try to tune everything, start hive like this: bin/hive -hiveconf hive.root.logger=DEBUG,console And record where the longest spaces with no output are, that is what you should try to tune first. On Fri, Jul 18, 2014 at 9:36 AM, diogo wrote: > This is probably a simple question, but I'm noticing that for queries that > run on 1+TB of data, it can take Hive up to 30 minutes to actually start > the first map-reduce stage. What is it doing? I imagine it's gathering > information about the data somehow, this 'startup' time is clearly a > function of the amount of data I'm trying to process. > > Cheers, > --001a1134c5ce2fa9c804fe788a4b Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
The planning phase needs to do work for every hive pa= rtition and every hadoop files. If you have a lot of 'small' files = or many partitions this can take a long time.
Also the planning phase t= hat happens on the job tracker is single threaded.
Also the new yarn stuff requires back and forth to allocated containers.
Sometimes raising the heap to for the hive-cli/launching process help= s because the default heap of 1 GB may not be a lot of space to deal with a= ll of the partition information and memory overhead will make this go faste= r.
Sometimes setting the min split size higher launches less map ta= sks which speeds up everything.

So the answer...Try to tu= ne everything, start hive like this:

bin/hive -hiveconf h= ive.root.logger=3DDEBUG,console

And record where the longest spaces with no output are, that= is what you should try to tune first.




On Fri, Jul 18,= 2014 at 9:36 AM, diogo <diogo@uken.com> wrote:
This is probably a simple q= uestion, but I'm noticing that for queries that run on 1+TB of data, it= can take Hive up to 30 minutes to actually start the first map-reduce stag= e. What is it doing? I imagine it's gathering information about the dat= a somehow, this 'startup' time is clearly a function of the amount = of data I'm trying to process.

Cheers,

--001a1134c5ce2fa9c804fe788a4b--