hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gopal Vijayaraghavan <>
Subject Re: Cross join/cartesian product explanation
Date Wed, 11 Nov 2015 01:17:55 GMT

>I¹m having trouble doing a cross join between two tables that are too big
>for a map-side join.

The actual query would help btw. Usually what is planned as a cross-join
can be optimized out into a binning query with a custom UDF.

In particular with 2-D geo queries with binning, which people tend to run
as cross-joins when they port PostGIS queries into Hive (ST_Contains).

>Trying to break down one table into small enough partitions and then
>unioning them together seems to give comparable performance to a cross
>Short of moving to a different execution engine, are there any
>performance improvements that can be made to lessen the pain of a cross

No, with MapReduce you're mostly stuck running each part of the union one
after the other.

Since this is a simple fan-out, you can try the simple parallelization

set hive.exec.parallel=true;

Beware, this has known deadlocks as queries get more complex - for which
you need a real Acyclic Graph scheduler engine like Tez.

Drop me an off-list mail if you want to run Tez on recent CDH, EMR or MapR

>Also, could you please elaborate on your comment ³The real trouble is
>that MapReduce cannot re-direct data at all (there¹s only shuffle
>edges)"? Thanks!

Mapreduce cannot do non pair-wise routing operations, since it violates
the direct assumption of map() partitioning (same applies to Spark's map()

Tez goes a bit out of the way to separate the control plane from the data
plane, so that you can do non-pairwise operations like auto-reducer
parallelism or splitting up pair-wise operations as m:n matching.

Here's slides from last year describing how that rewiring works.


View raw message