Return-Path: X-Original-To: apmail-hadoop-common-user-archive@www.apache.org Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 2E9BD10923 for ; Wed, 5 Mar 2014 14:03:10 +0000 (UTC) Received: (qmail 27923 invoked by uid 500); 5 Mar 2014 14:03:01 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 27708 invoked by uid 500); 5 Mar 2014 14:03:00 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 27689 invoked by uid 99); 5 Mar 2014 14:02:59 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 05 Mar 2014 14:02:59 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of anthony@mattas.net designates 209.85.223.173 as permitted sender) Received: from [209.85.223.173] (HELO mail-ie0-f173.google.com) (209.85.223.173) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 05 Mar 2014 14:02:54 +0000 Received: by mail-ie0-f173.google.com with SMTP id rl12so1053096iec.4 for ; Wed, 05 Mar 2014 06:02:32 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=mattas.net; s=google; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=CNeXW7+QAlGu0SwRSDkQSE0g6QwK7v8la6VitkwWJi0=; b=HB5dwXFb3hYTb/bF0kA2FtpI+AvO/ipfnFUrki3DZw+gT45ZWgoG2WdegGmB11m+bp 6dmxzJ6ydCdm9w0MMCE/Z7rdx0QyAHHQy9B28DmOtGYf1X0oRuADXcxprYxAeF3JzUch we15NWzOR4qEvddcSA72418qvL6MvscnhFGL0= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type; bh=CNeXW7+QAlGu0SwRSDkQSE0g6QwK7v8la6VitkwWJi0=; b=TI6ZhXKQJvq8qEw6OUD8z44urd6imw6tBNLUHiR9E8IxmfHyy0kUvqAITxy/1zDHmT MBpup0GQaR1TDgFgPb21RE4HG5j7f/mjgcyQUHqACr17gy2UFkT5jUqZJmYg36Vuprx0 eINpZLDmKlPquXgm84QzLexzwv3qjay7UKZloc5rC9R7BvBlvxWpJgcGrK7QaUeJrFvQ lUkGuqDEA9TqEgbZg7FldjuUUK5npaCaVje/5GtozvcTu61e0M6GnONMsHsqk0yIzadE BSfbHfq76qO8d/0oSK6O35ZZeZFq/nzSc/eNcAbhMHKMDAnXr3qTpxptsJTR0842nAJS ouKQ== X-Gm-Message-State: ALoCoQmrvKaQ0CIH9G3nrzVX/tATvmlDAsDWAGOjMG9j+InzzEI5Gk0lUC/cq65yQVuRa5QRJ7gX MIME-Version: 1.0 X-Received: by 10.50.50.41 with SMTP id z9mr9344286ign.16.1394028152710; Wed, 05 Mar 2014 06:02:32 -0800 (PST) Received: by 10.64.227.17 with HTTP; Wed, 5 Mar 2014 06:02:32 -0800 (PST) In-Reply-To: References: <4272CEF8-ADB6-4AB4-908D-57D94158B025@mattas.net> Date: Wed, 5 Mar 2014 09:02:32 -0500 Message-ID: Subject: Re: Benchmarking Hive Changes From: Anthony Mattas To: user@hadoop.apache.org Content-Type: multipart/alternative; boundary=089e011835dad9d1ed04f3dc7884 X-Virus-Checked: Checked by ClamAV on apache.org --089e011835dad9d1ed04f3dc7884 Content-Type: text/plain; charset=ISO-8859-1 Yes, I'm using the HortonWorks Data Platform 2.0 Sandbox which is a standalone box. But shame on me it looks like the files are both very tiny (46K), I'm seeing about 23 seconds per query, which appears mostly to be starting up MR. So I'm going to find a new data set and try again, is there any types of optimizations that can be done to reduce the start up time? Ultimately I'm trying to compare the response time in Hive versus an EDW platform - of course I still expect the EDW to perform more performantly, but with the advancements in the newer versions of Hive I'm hoping for at least a reasonable response for a user wishing to do interactive querying. Specifically using Hive, I know you can get really good performance out of Impala, but am not yet interested in going that route. Anthony Mattas anthony@mattas.net On Wed, Mar 5, 2014 at 8:47 AM, java8964 wrote: > Are you doing on standalone one box? How large are your test files and how > long of the jobs of each type took? > > Yong > > > From: anthony@mattas.net > > Subject: Benchmarking Hive Changes > > Date: Tue, 4 Mar 2014 21:31:42 -0500 > > To: user@hadoop.apache.org > > > > > I've been trying to benchmark some of the Hive enhancements in Hadoop > 2.0 using the HDP Sandbox. > > > > I took one of their example queries and executed it with the tables > stored as TEXTFILE, RCFILE, and ORC. I also tried enabling enabling > vectorized execution, and predicate pushdown. > > > > SELECT s07.description, s07.salary, s08.salary, > > s08.salary - s07.salary > > FROM > > sample_07 s07 JOIN sample_08 s08 > > ON ( s07.code = s08.code) > > WHERE > > s07.salary < s08.salary > > SORT BY s08.salary-s07.salary DESC > > > > Ultimately there was not much different performance in any of the > executions, can someone clarify for me if I need an actual full cluster to > see performance improvements, or if I'm missing something else. I thought > at minimum I would have seen an improvement moving to ORC from TEXTFILE. > --089e011835dad9d1ed04f3dc7884 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
Yes, I'm using the HortonWorks Data Platform 2.0 Sandb= ox which is a standalone box.

But shame on me it looks like the files are both very tiny (46K),= I'm seeing about 23 seconds per query, which appears mostly to be star= ting up MR. 
=
So I'm going to find a new data set and try again, is there a= ny types of optimizations that can be done to reduce the start up time?
=
Ultimately I'm trying to compare the response time in Hive ve= rsus an EDW platform - of course I still expect the EDW to perform more per= formantly, but with the advancements in the newer versions of Hive I'm = hoping for at least a reasonable response for a user wishing to do interact= ive querying. Specifically using Hive, I know you can get really good perfo= rmance out of Impala, but am not yet interested in going that route.

Anthony Mattas
<= a href=3D"mailto:anthony@mattas.net">anthony@mattas.net


On Wed, Mar 5, 2014 at 8:47 AM, java8964= <java8964@hotmail.com> wrote:
Are you doing on standalone one box? How large are yo= ur test files and how long of the jobs of each type took?

Yong

> From: anthony@mattas.net
> Subject: Benchmarking Hive Changes
> Date: Tue, 4 Mar 2014 21:31= :42 -0500
> To: user@hadoop.apache.org

>
> I= ’ve been trying to benchmark some of the Hive enhancements in Hadoop = 2.0 using the HDP Sandbox.
>
> I took one of their example queries and executed it with the = tables stored as TEXTFILE, RCFILE, and ORC. I also tried enabling enabling = vectorized execution, and predicate pushdown.
>
> SELECT s07.d= escription, s07.salary, s08.salary,
> s08.salary - s07.salary
> FROM
> sample_07 s07 JOIN sa= mple_08 s08
> ON ( s07.code =3D s08.code)
> WHERE
> s07.= salary < s08.salary
> SORT BY s08.salary-s07.salary DESC
> <= br> > Ultimately there was not much different performance in any of the exec= utions, can someone clarify for me if I need an actual full cluster to see = performance improvements, or if I’m missing something else. I thought= at minimum I would have seen an improvement moving to ORC from TEXTFILE.

--089e011835dad9d1ed04f3dc7884--