hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Charles Menguy <menguy.char...@gmail.com>
Subject MAPJOIN hint giving incomplete results with union and/or groupby in query
Date Thu, 26 Jul 2012 22:41:09 GMT
*Hi guys,*
*
*
*I was having trouble lately getting a certain Hive query to work fine, and
I would like some advice.*
*
*
*My query is mainly joining a large table with several small tables. This
takes a while using regular joins, so it felt like a good use case to use
optimizer hints to cache the small tables in memory (they are definitely
small enough to fit in memory).*
*
*
*When I've tried running it with  /*+ MAPJOIN(a, b, c) */, it works fine
and much faster as expected.*
*But what I don't understand is that, simply by adding or removing this
optimizer hint, the results vary a lot and. While I am getting the correct
results without the mapjoin, the results with the mapjoin are incomplete.*
*
*
*To give more context, my large table is actually an union between 2
sub-queries, each identified by an id, and then I group everything together.
*
*What I get in the end result is all of the records from one side of the
union, but none from the other side of the union.*
*And the weird thing is that it seems to vary, I can get all the records
from the left side of the union, but i've also gotten all the records from
the right side of the union sometimes. Either side that I get in the end is
complete and correct, but I get nothing for the other half.*
*
*
*I'd like to know if there is anything I'm not aware of that could give
different results when using a mapjoin to cache the small tables? I'm not
using any extra Hive flags, just a simple mapjoin to cache the small tables.
*
*Could this be a Hive bug? I saw some Jira tickets about errors with
mapjoins and union/groupby, but I'm not getting any exception here as
opposed to what I could find.*
*I'm using cdh3u2 hive version 0.7.1*
*
*
*Here's the outline of my query if that helps: *
*
*
*select /*+ MAPJOIN(smalla, smallb, smallc) */*
*  **
*from*
*  (select * from*
*    tmp_table tbl*
*    join (*
*      select *** **from*
*        log_table l*
*        lateral view explode(l.list1) tmp as data*
*    ) lat on (...)*
*  UNION ALL*
*  select * from*
*    tmp_table tbl*
*    join (*
*      select * from*
*        log_table l*
*        lateral view explode(l.list2) tmp as data*
*    ) lat on (...)*
*  ) big*
*  JOIN small_table smalla on (...)*
*  JOIN small_table smallb on (...)*
*  JOIN small_table smallc on (...)*
*GROUP BY*
*  ...*
*
*
*Note that it seems to give complete results though if I put the UNION
query for the big table into a temporary table, and then use this temporary
table directly within the final query. But whenever the union is inside the
final query with the hint, half of it gets ignored. I'd rather do it in a
single query than creating a temporary table, so I wanted to make sure
there is no other way, otherwise I'll probably stick with that approach.*
*
*
*Thanks !*

Mime
View raw message