hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Daniel,Wu" <hadoop...@163.com>
Subject Re:Re: multiple tables join with only one hug table.
Date Sat, 13 Aug 2011 10:58:27 GMT
Thanks, it works, but not as effective as possible:

suppose we join 10 small tables (s1,s2...s10) with one huge table (big) in a database house
system (the join is between big table and small table, like star schema), after I set the
parameters as you set, it will have 10 mapside join, after one mapside join competes, it will
write huge data to file system (as one table is huge), then the next mapside join need to
read the hug data written to do another mapside join, so totally we need to read the huge
data 11 times and write it 10 times´╝łas the last write only return small data volume).  
The best execution plan I can think of is: first build 10 hash table: one for each small table,
and loop each row in the big table, if the row survive, just output, if not then discard,
in this way we only need to read the big data once, instead of read big data, write big data,
read big data, ...

flow is:
1: build 10 hash table
2: foreach row in big table
         probe the row with each of these 10 hash table
         if match all these 10 hash table, go to next step (output, etc)
         else discard the row.
    end loop


At 2011-08-13 01:17:16,"Koert Kuipers" <koert@tresata.com> wrote:
A mapjoin does what you described: it builds hash tables for the smaller tables. In recent
versions of hive (like the one i am using with cloudera cdh3u1) a mapjoin will be done for
you automatically if you have your parameters set correctly. The relevant parameters in hive-site.xml
are: hive.auto.convert.join, hive.mapjoin.maxsize and hive.mapjoin.smalltable.filesize. On
the hive command line it will tell you that it is building the hashtable, and it will not
run a reducer.


On Thu, Aug 11, 2011 at 10:25 PM, Ayon Sinha<ayonsinha@yahoo.com> wrote:

The Mapjoin hint syntax help optimize by loading the smaller tables specified in the Mapjoin
hint into memory. Then every small table is in memory of each mapper.
 
-Ayon
See My Photos on Flickr
Also check out my Blog for answers to commonly asked questions.


From: "Daniel,Wu" <hadoop_wu@163.com>
To: hive <user@hive.apache.org>
Sent: Thursday, August 11, 2011 7:01 PM
Subject: multiple tables join with only one hug table.



if the retailer fact table is sale_fact with 10B rows, and join with 3 small tables: stores
(10K), products(10K), period (1K). What's the best join solution?

In oracle, it can first build hash for stores, and hash for products, and hash for stores.
Then probe using the fact table, if the row matched in stores, that row can go up further
to map with products by hashing check, if pass, then go up further to try to match period.
In this way, the sale_fact only needs to be scanned once which save lots of disk IO.  Is this
doable in hive, if doable, what hint need to use?








Mime
View raw message