hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anthony Mattas <anth...@mattas.net>
Subject Re: Benchmarking Hive Changes
Date Wed, 05 Mar 2014 16:15:19 GMT
Hi Yong,

I'm confused - I'm using Hive 0.12.0, shouldn't that be using "stinger" by
default? Or is there configurations that have to be enabled? 

Anthony Mattas
anthony@mattas.net


On Wed, Mar 5, 2014 at 11:06 AM, java8964 <java8964@hotmail.com> wrote:

> Your files are too small for any meaningful test of these 3 file types.
>
> Most of the 23 seconds are spending on preparing/starting your MR job and
> shutdown.
>
> You need at least Gs data to compare the performance of these 3 types, to
> get any meaningful result.
>
> But as long as it is Hive on top of MapReduce, it will be really hard to
> archive an "interactive" result. MapReduce is a batch mode, period.
>
> You do want to consider Impala/spark or Apache stinger, if you really are
> looking for "interactive".
>
> Yong
>
> ------------------------------
> Date: Wed, 5 Mar 2014 09:02:32 -0500
> Subject: Re: Benchmarking Hive Changes
> From: anthony@mattas.net
> To: user@hadoop.apache.org
>
>
> Yes, I'm using the HortonWorks Data Platform 2.0 Sandbox which is a
> standalone box.
>
> But shame on me it looks like the files are both very tiny (46K), I'm
> seeing about 23 seconds per query, which appears mostly to be starting up
> MR.
>
> So I'm going to find a new data set and try again, is there any types of
> optimizations that can be done to reduce the start up time?
>
> Ultimately I'm trying to compare the response time in Hive versus an EDW
> platform - of course I still expect the EDW to perform more performantly,
> but with the advancements in the newer versions of Hive I'm hoping for at
> least a reasonable response for a user wishing to do interactive querying.
> Specifically using Hive, I know you can get really good performance out of
> Impala, but am not yet interested in going that route.
>
> Anthony Mattas
> anthony@mattas.net
>
>
> On Wed, Mar 5, 2014 at 8:47 AM, java8964 <java8964@hotmail.com> wrote:
>
> Are you doing on standalone one box? How large are your test files and how
> long of the jobs of each type took?
>
> Yong
>
> > From: anthony@mattas.net
> > Subject: Benchmarking Hive Changes
> > Date: Tue, 4 Mar 2014 21:31:42 -0500
> > To: user@hadoop.apache.org
>
> >
> > I've been trying to benchmark some of the Hive enhancements in Hadoop
> 2.0 using the HDP Sandbox.
> >
> > I took one of their example queries and executed it with the tables
> stored as TEXTFILE, RCFILE, and ORC. I also tried enabling enabling
> vectorized execution, and predicate pushdown.
> >
> > SELECT s07.description, s07.salary, s08.salary,
> > s08.salary - s07.salary
> > FROM
> > sample_07 s07 JOIN sample_08 s08
> > ON ( s07.code = s08.code)
> > WHERE
> > s07.salary < s08.salary
> > SORT BY s08.salary-s07.salary DESC
> >
> > Ultimately there was not much different performance in any of the
> executions, can someone clarify for me if I need an actual full cluster to
> see performance improvements, or if I'm missing something else. I thought
> at minimum I would have seen an improvement moving to ORC from TEXTFILE.
>
>
>

Mime
View raw message