hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rory Sawyer <>
Subject Re: Cross join/cartesian product explanation
Date Mon, 09 Nov 2015 20:45:05 GMT
Hi Gopal,

Thanks for the speedy response! A follow-up question though: 10Mb input sounds like that would
work for a map join. I’m having trouble doing a cross join between two tables that are too
big for a map-side join. Trying to break down one table into small enough partitions and then
unioning them together seems to give comparable performance to a cross join. I’m running
Hive on Map Reduce right now. Short of moving to a different execution engine, are there any
performance improvements that can be made to lessen the pain of a cross join? Also, could
you please elaborate on your comment “The real trouble is that MapReduce cannot re-direct
data at all (there’s only shuffle edges)"? Thanks!


On 11/6/15, 5:09 PM, "Gopal Vijayaraghavan" < on behalf of>

>> Over the last few week I¹ve been trying to use cross joins/cartesian
>>products and was wondering why, exactly, this all gets sent to one
>>reducer. All I¹ve heard or read is that Hive can¹t/doesn¹t parallelize
>>the job. 
>The hashcode of the shuffle key is 0, since you need to process every row
>against every key - there's no possibility of dividing up the work.
>Tez will actually have a cross-product edge (TEZ-2104), which is a
>distributed cross-product proposal but wasn't picked up in the last Google
>Summer of Code.
>The real trouble is that MapReduce cannot re-direct data at all (there's
>only shuffle edges).
>> Does anyone have a workaround?
>I use a staged partitioned table as a workaround for this, hashed on a
>high nDV key - the goal of the Tez edge is to shuffle the data similarly
>at runtime.
>For instance, this python script makes a query with a 19x improvement in
>distribution for a cross-product which generates 50+Gb of data from a
>~10Mb input.
>It is possible for Hive-Tez to actually generate UNION VertexGroups, but
>it's much more efficient to do this as a edge with a custom EdgeManager,
>since that opens up potentially implementing ThetaJoins in hive using that.
View raw message