hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Koert Kuipers <>
Subject Re: multiple tables join with only one hug table.
Date Fri, 12 Aug 2011 17:17:16 GMT
A mapjoin does what you described: it builds hash tables for the smaller
tables. In recent versions of hive (like the one i am using with cloudera
cdh3u1) a mapjoin will be done for you automatically if you have your
parameters set correctly. The relevant parameters in hive-site.xml are:, hive.mapjoin.maxsize and
hive.mapjoin.smalltable.filesize. On the hive command line it will tell you
that it is building the hashtable, and it will not run a reducer.

On Thu, Aug 11, 2011 at 10:25 PM, Ayon Sinha <> wrote:

> The Mapjoin hint syntax help optimize by loading the smaller tables
> specified in the Mapjoin hint into memory. Then every small table is in
> memory of each mapper.
> -Ayon
> See My Photos on Flickr <>
> Also check out my Blog for answers to commonly asked questions.<>
> ------------------------------
> *From:* "Daniel,Wu" <>
> *To:* hive <>
> *Sent:* Thursday, August 11, 2011 7:01 PM
> *Subject:* multiple tables join with only one hug table.
> if the retailer fact table is sale_fact with 10B rows, and join with 3
> small tables: stores (10K), products(10K), period (1K). What's the best join
> solution?
> In oracle, it can first build hash for stores, and hash for products, and
> hash for stores. Then probe using the fact table, if the row matched in
> stores, that row can go up further to map with products by hashing check, if
> pass, then go up further to try to match period. In this way, the sale_fact
> only needs to be scanned once which save lots of disk IO.  Is this doable in
> hive, if doable, what hint need to use?

View raw message