hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Howard <>
Subject mapjoin with left join
Date Fri, 11 Sep 2015 16:48:49 GMT
We would like to utilize mapjoin for the following SQL construct:

select small.* from small s left join large l on = where is

We can easily fit small into RAM, but large is over 1TB according to
optimizer stats. Unless we set = to at least the size of
"large", the optimizer falls back to a common map join, which is incredibly

Given the fact it is a left join, which means we won't always have rows in
large for each row in small, is this behavior expected? Could it be that
reading the large table would miss the new rows in small, so the large one
has to be the one that is probed for matches?

We simply want to load the 81K rows in to RAM, then for each row in large,
check the small hash table and if it the row in small is not in large, then
add it to large.

Again, the optimizer will use a mapjoin if we set = 1TB (the size of the large
table). This is of course, not practical. The small table is only 50MB.

At the link below is the entire test case with two tables, one of which has
three rows and other has 96. We can duplicate it with tables this small,
which leads me to believe I am missing something, or this is a bug.

The link has the source code that shows each table create, as well as the
explain with an argument for
that is passed at the command line. The output shows a mergejoin when the size is less than 192 (the
size of the larger table), and a mapjoin when is larger than 192 (large
table fits).

The business case is loading only new rows into a large fact table.  The
new rows are the ones that are small in number.

View raw message