Return-Path: Delivered-To: apmail-hive-user-archive@www.apache.org Received: (qmail 67855 invoked from network); 3 Feb 2011 23:50:28 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 3 Feb 2011 23:50:28 -0000 Received: (qmail 78838 invoked by uid 500); 3 Feb 2011 23:50:27 -0000 Delivered-To: apmail-hive-user-archive@hive.apache.org Received: (qmail 78684 invoked by uid 500); 3 Feb 2011 23:50:26 -0000 Mailing-List: contact user-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hive.apache.org Delivered-To: mailing list user@hive.apache.org Received: (qmail 78669 invoked by uid 99); 3 Feb 2011 23:50:26 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 03 Feb 2011 23:50:26 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=FREEMAIL_FROM,RCVD_IN_DNSWL_LOW,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of techvd@gmail.com designates 74.125.82.176 as permitted sender) Received: from [74.125.82.176] (HELO mail-wy0-f176.google.com) (74.125.82.176) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 03 Feb 2011 23:50:19 +0000 Received: by wye20 with SMTP id 20so1750885wye.35 for ; Thu, 03 Feb 2011 15:49:58 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:in-reply-to:references:date :message-id:subject:from:to:cc:content-type; bh=rwKrEbnw4vzRChTidwmFf86CLFW/MIyaa/H8HMEOLE8=; b=nD/NO2PPK0kSpY575Up9rdgj67Du+jpRtwNXJnZCojS8Yn495+ma5XSxExAKnZ5LbR 5G7UFbhIyLyAaVWRRh3enLOpxC/Oa0vpS/qd0tv1DIwdPXLOqPA43/HwI8aoOKLlY0jy WSpBep7D01c9a9LUQy+B3VeQjZ+dI2GCZBCFA= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; b=aJii9ORBl3BeTAByQ03Yr1MSwSTyl4IHCSNRXmpu2m6yRfj44JJSRlOzJj0Z57MbmM bsQz6b1/XXi/fCVAktPeGt8EIlfaDRkiAcSF1XXEvUJS1CYLp5uJff10NtmuwJhiWlJs /66+IsYuQENRRiIzeKv6ESLYSl9m8C4GhEIzI= MIME-Version: 1.0 Received: by 10.216.13.134 with SMTP id b6mr54444web.25.1296776998546; Thu, 03 Feb 2011 15:49:58 -0800 (PST) Received: by 10.216.255.147 with HTTP; Thu, 3 Feb 2011 15:49:58 -0800 (PST) In-Reply-To: References: Date: Thu, 3 Feb 2011 15:49:58 -0800 Message-ID: Subject: Re: Hive queries consuming 100% cpu From: Vijay To: user@hive.apache.org Cc: dev@hive.apache.org Content-Type: text/plain; charset=ISO-8859-1 X-Virus-Checked: Checked by ClamAV on apache.org Sorry i should've given more details. The query was limited by a partition range; I just omitted the WHERE clause in the mail. The table is not that big. For each day, there is one gzipped file. The largest file is about 250MB (close to 2GB uncompressed). I did intend to count and that was just to test since I wanted to run a query that did the most minimal logic/processing. Here's a test I ran now. The query is getting count(1) for 8 days. It spawned 8 maps as expected. The maps run for anywhere between 42 to 69 seconds (which may or may not be right; I need to check that). It spawned only one reduce task. The reducer ran for 117 seconds, which seems long for this query. On Thu, Feb 3, 2011 at 2:31 PM, Viral Bajaria wrote: > Hey Vijay, > You can go to the mapred ui, normally it runs on port 50030 of the namenode > and see how many map jobs got created for your submitted query. > You said that the events table has daily partitions but the example query > that you have does not prune the partitions by specifying a WHERE clause. So > I have the following questions > 1) how big is the table (you can just do a hadoop dfs -dus > ? how many partitions ? > 2) do you really intend to count the number of events across all days ? > 3) could you build a query which computes over 1-5 day(s) and persists the > data in a separate table for consumption later on ? > Based on your node configuration, I am just guessing the amount of data to > process is too large and hence the high CPU. > Thanks, > Viral > On Thu, Feb 3, 2011 at 12:49 PM, Vijay wrote: >> >> Hi, >> >> The simplest of hive queries seem to be consuming 100% cpu. This is >> with a small 4-node cluster. The machines are pretty beefy (16 cores >> per machine, tons of RAM, 16 M+R maximum tasks configured, 1GB RAM for >> mapred.child.java.opts, etc). A simple query like "select count(1) >> from events" where the events table has daily partitions of log files >> in gzipped file format). While this is probably too generic a question >> and there is a bunch of investigation we need to, are there any >> specific areas for me to look at? Has anyone see anything like this >> before? Also, are there any tools or easy options to profile hive >> query execution? >> >> Thanks in advance, >> Vijay > >