hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From shule ney <neysh...@gmail.com>
Subject Re: Reduce the number of map/reduce jobs during join
Date Tue, 13 Mar 2012 17:15:07 GMT
Do the joins share the same key?

2012/3/13 Bruce Bian <weidong.ban@gmail.com>

> Yes,it's in my hive-default.xml and Hive figured to use one reducer only,
> so I thought increase it to 5 might help,which doesn't.
> Anyway, to scan the largest table 6 times isn't efficient hence my
> question.
>
>
> On Wed, Mar 14, 2012 at 12:37 AM, Jagat <jagatsingh@gmail.com> wrote:
> >
> > Hello Weidong Bian
> >
> > Did you see the following configuration properties in conf directory
> >
> >
> > <property>
> >   <name>mapred.reduce.tasks</name>
> >   <value>-1</value>
> >     <description>The default number of reduce tasks per job.  Typically
> set
> >   to a prime close to the number of available hosts.  Ignored when
> >   mapred.job.tracker is "local". Hadoop set this to 1 by default,
> whereas hive uses -1 as its default value.
> >   By setting this property to -1, Hive will automatically figure out
> what should be the number of reducers.
> >   </description>
> > </property>
> >
> >
> > <property>
> >   <name>hive.exec.reducers.max</name>
> >   <value>999</value>
> >   <description>max number of reducers will be used. If the one
> >     specified in the configuration parameter mapred.reduce.tasks is
> >     negative, hive will use this one as the max number of reducers when
> >     automatically determine number of reducers.</description>
> > </property>
> >
> > Thanks and Regards
> >
> > Jagat
> >
> >
> > On Tue, Mar 13, 2012 at 9:54 PM, Bruce Bian <weidong.ban@gmail.com>
> wrote:
> >>
> >> Hi there,
> >> when I'm using Hive to doing a query as follows, 6 Map/Reduce jobs are
> launched, one for each join, and it deals with ~460M data in ~950 seconds,
> which I think is way toooo slow for a cluster with 5 slaves and 24GB
> memory/12 disks each.
> >> set mapred.reduce.tasks=5;
> >> SELECT a.*,e.code_name as is_internet_flg, f.code_name as
> wb_access_tp_desc, g.code_name as free_tp_desc,
> >> b.acnt_no,b.addr_id,b.postcode,b.acnt_rmnd_tp,b.print_tp,b.media_type,
> >> c.cust_code,c.root_cust_code,
> >>
> d.mdf_name,d.sub_bureau_code,d.bureau_cd,d.adm_sub_bureau_name,d.bureau_name
> >> FROM prc_idap_pi_root a
> >>  LEFT OUTER JOIN idap_pi_root_acnt b ON a.acnt_id=b.acnt_id
> >>  LEFT OUTER JOIN idap_pi_root_cust c ON a.cust_id=c.cust_id
> >>  LEFT OUTER JOIN ocrm_vt_area d ON a.dev_area_id=d.area_id
> >>  LEFT OUTER JOIN osor_code e ON a.data_internet_flg=e.code_val and
> e.code_tp='IS_INTERNET_FLG'
> >>  LEFT OUTER JOIN osor_code f ON a.wb_access_tp=f.code_val and
> f.code_tp='WEB_ACCESS_TP'
> >>  LEFT OUTER JOIN osor_code g ON a.free_tp=g.code_val and
> g.code_tp='FREE_TP';
> >> For each jobs, most of the time is consumed by the reduce jobs. As the
> idap_pi_root is very large, to scan over it for 6 times is quite
> inefficient. Is it possible to reduce the map/reduce jobs to only one?
> >> Thanks,
> >> Weidong Bian
>
>

Mime
View raw message