hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Keith Wiley <kwi...@keithwiley.com>
Subject Re: Want query to use more reducers
Date Mon, 30 Sep 2013 20:40:56 GMT
Thanks.  mapred.reduce.tasks and hive.exec.reducers.max seem to have fixed the problem.  It
is now saturating the cluster and running the query super fast.  Excellent!

On Sep 30, 2013, at 12:28 , Sean Busbey wrote:

> Hey Keith,
> 
> It sounds like you should tweak the settings for how Hive handles query execution[1]:
> 
> 1) Tune the guessed number of reducers based on input size
> 
> = hive.exec.reducers.bytes.per.reducer
> 
> Defaults to 1G. Based on your description, it sounds like this is probably still at default.
> 
> In this case, you should also set a max # of reducers based on your cluster size.
> 
> = hive.exec.reducers.max
> 
> I usually set this to the # reduce slots, if there's a decent chance I'll get to saturate
the cluster. If not, don't worry about it.
> 
> 2) Hard code a number of reducers
> 
> = mapred.reduce.tasks
> 
> Setting this will cause Hive to always use that number. It defaults to -1, which tells
hive to use the heuristic about input size to guess.
> 
> In either of the above cases, you should look at the options to merge small files (search
for "merge"  in the configuration property list) to avoid getting lots of little outputs.
> 
> HTH
> 
> [1]: https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-QueryExecution
> 
> -Sean
> 
> On Mon, Sep 30, 2013 at 11:31 AM, Keith Wiley <kwiley@keithwiley.com> wrote:
> I have a query that doesn't use reducers as efficiently as I would hope.  If I run it
on a large table, it uses more reducers, even saturating the cluster, as I desire.  However,
on smaller tables it uses as low as a single reducer.  While I understand there is a logic
in this (not using multiple reducers until the data size is larger), it is nevertheless inefficient
to run a query for thirty minutes leaving the entire cluster vacant when the query could distribute
the work evenly and wrap things up in a fraction of the time.  The query is shown below (abstracted
to its basic form).  As you can see, it is a little atypical: it is a nested query which obviously
implies two map-reduce jobs and it uses a script for the reducer stage that I am trying to
speed up.  I thought the "distribute by" clause should make it use the reducers more evenly,
but as I said, that is not the behavior I am seeing.
> 
> Any ideas how I could improve this situation?
> 
> Thanks.
> 
> CREATE TABLE output_table ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' as
> SELECT * FROM (
>         FROM (
>                 SELECT * FROM input_table
>                 DISTRIBUTE BY input_column_1 SORT BY input_column_1 ASC, input_column_2
ASC, input_column_etc ASC) q
>         SELECT TRANSFORM(*)
>         USING 'python my_reducer_script.py' AS(
>         output_column_1,
>         output_column_2,
>         output_column_etc,
>         )
> ) s
> ORDER BY output_column_1;
> 
> ________________________________________________________________________________
> Keith Wiley     kwiley@keithwiley.com     keithwiley.com    music.keithwiley.com
> 
> "Luminous beings are we, not this crude matter."
>                                            --  Yoda
> ________________________________________________________________________________
> 
> 
> 
> 
> -- 
> Sean


________________________________________________________________________________
Keith Wiley     kwiley@keithwiley.com     keithwiley.com    music.keithwiley.com

"I do not feel obliged to believe that the same God who has endowed us with
sense, reason, and intellect has intended us to forgo their use."
                                           --  Galileo Galilei
________________________________________________________________________________


Mime
View raw message