hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From fab wol <darkwoll...@gmail.com>
Subject Re: Best way to avoid cross join
Date Wed, 05 Mar 2014 14:27:48 GMT
hey Yong,

Even without the group by (pure cross join) the query is only using one
reducer. Even specifying more reducers doesn't help:

set mapred.reduce.tasks=50;
SELECT id1,
       m.keyword,
       prep_kw.keyword
FROM (select id1, keyword from import1) m
CROSS JOIN
  (SELECT keyword FROM et_keywords) prep_kw;

...

Hadoop job information for Stage-1: number of mappers: 3; number of
reducers: 1

What could be setup wrong here? Or can it be avoided to use this ugly cross
join at all? I mean my original problem is actually something else ;-)

Cheers
Wolli


2014-03-05 15:07 GMT+01:00 java8964 <java8964@hotmail.com>:

> Hi, Wolli:
>
> Cross join doesn't mean Hive has to use one reduce.
>
> From query point of view, the following cases will use one reducer:
>
> 1) Order by in your query (Instead of using sort by)
> 2) Only one reducer group, which means all the data have to send to one
> reducer, as there is only one reducer group.
>
> In your case, distinct count of id1 will be the reducer group count. Did
> you explicitly set the reducer count in your hive session?
>
> Yong
>
> ------------------------------
> Date: Wed, 5 Mar 2014 14:17:24 +0100
> Subject: Best way to avoid cross join
> From: darkwolli32@gmail.com
> To: user@hive.apache.org
>
> Hey everyone,
>
> before i write a lot of text, i just post something which is already
> written:
> http://www.sqlservercentral.com/Forums/Topic1328496-360-1.aspx
>
> The first posts adresses a pretty similar problem i also have. Currently
> my implementation looks like this:
>
> SELECT id1,
>   MAX(
>   CASE
>     WHEN m.keyword IS NULL
>     THEN 0
>     WHEN instr(m.keyword, prep_kw.keyword) > 0
>     THEN 1
>     ELSE 0
>   END) AS flag
> FROM (select id1, keyword from import1) m
> CROSS JOIN
>   (SELECT keyword FROM et_keywords) prep_kw
> GROUP BY id1;
>
> Since there is a cross join involved, the execution gets pinned down to 1
> reducer only and it takes ages to complete.
>
> The thread i posted is solving this with some special SQLserver tactics.
> But I was wondering if anybody has encountered the problem in Hive already
> and found a better way to solve this.
>
> I'm using Hive 0.11 on a MapR Distribution, if this is somehow important.
>
> Cheers
> Wolli
>

Mime
View raw message