hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anthony Mattas <anth...@mattas.net>
Subject Re: Benchmarking Hive Changes
Date Wed, 05 Mar 2014 14:02:32 GMT
Yes, I'm using the HortonWorks Data Platform 2.0 Sandbox which is a
standalone box.

But shame on me it looks like the files are both very tiny (46K), I'm
seeing about 23 seconds per query, which appears mostly to be starting up
MR.

So I'm going to find a new data set and try again, is there any types of
optimizations that can be done to reduce the start up time?

Ultimately I'm trying to compare the response time in Hive versus an EDW
platform - of course I still expect the EDW to perform more performantly,
but with the advancements in the newer versions of Hive I'm hoping for at
least a reasonable response for a user wishing to do interactive querying.
Specifically using Hive, I know you can get really good performance out of
Impala, but am not yet interested in going that route.

Anthony Mattas
anthony@mattas.net


On Wed, Mar 5, 2014 at 8:47 AM, java8964 <java8964@hotmail.com> wrote:

> Are you doing on standalone one box? How large are your test files and how
> long of the jobs of each type took?
>
> Yong
>
> > From: anthony@mattas.net
> > Subject: Benchmarking Hive Changes
> > Date: Tue, 4 Mar 2014 21:31:42 -0500
> > To: user@hadoop.apache.org
>
> >
> > I've been trying to benchmark some of the Hive enhancements in Hadoop
> 2.0 using the HDP Sandbox.
> >
> > I took one of their example queries and executed it with the tables
> stored as TEXTFILE, RCFILE, and ORC. I also tried enabling enabling
> vectorized execution, and predicate pushdown.
> >
> > SELECT s07.description, s07.salary, s08.salary,
> > s08.salary - s07.salary
> > FROM
> > sample_07 s07 JOIN sample_08 s08
> > ON ( s07.code = s08.code)
> > WHERE
> > s07.salary < s08.salary
> > SORT BY s08.salary-s07.salary DESC
> >
> > Ultimately there was not much different performance in any of the
> executions, can someone clarify for me if I need an actual full cluster to
> see performance improvements, or if I'm missing something else. I thought
> at minimum I would have seen an improvement moving to ORC from TEXTFILE.
>

Mime
View raw message