hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Update of "Hive/JoinOptimization" by LiyinTang
Date Thu, 02 Dec 2010 02:19:31 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The "Hive/JoinOptimization" page has been changed by LiyinTang.
http://wiki.apache.org/hadoop/Hive/JoinOptimization?action=diff&rev1=11&rev2=12

--------------------------------------------------

  '''Fig 1. The Previous Map Join Implementation'''
  
  In Fig 1 above, the previous map join implementation does not scale well when the larger
table is huge because each Mapper will directly read the small table data from HDFS. If the
larger table is huge, there will be thousands of Mapper launched to read different records
of the larger table. And those thousands of Mappers will read this small table data from HDFS
into their memory, which can make access to the small table become the performance bottleneck;
or, sometimes Mappers will get lots of time-outs for reading this small file, which may cause
the task to fail.
- 
  
  Hive-1641 ([[http://issues.apache.org/jira/browse/HIVE-1293|http://issues.apache.org/jira/browse/HIVE-1641]])
has solved this problem, as shown in Fig2 below.
  
@@ -51, +50 @@

  
  '''Fig 5: The Join Execution Flow'''
  
- As shown in fig5, the left side shows the previous Common Join execution flow, which is
very straightforward. On the other side, the right side is the new Common Join execution flow.
During the compile time, the query processor will generate a Conditional Task, which contains
a list of tasks and one of these tasks will be resolved to run during the execution time.
It means the tasks in the Conditional Task's list are the candidates and one of them will
be chosen to run during the run time. First, the original Common Join Task should be put into
the list. Also the query processor will generate a series of Map Join Task by assuming each
of the input tables may be the big table. Take the same example as before, '''''select /*+mapjoin(a)*/
* from src1 x  join src2y on x.key=y.key'''''. Both table '''''src2 '''''and '''''src1 '''''may
be the big table, so it will generate 2 Map Join Task. One is assuming src1 is the big table
and the other is assuming src2 is the big table, as shown in Fig 6.
+ As shown in fig5, the left side shows the previous Common Join execution flow, which is
very straightforward. On the other side, the right side is the new Common Join execution flow.
During the compile time, the query processor will generate a Conditional Task, which contains
a list of tasks and one of these tasks will be resolved to run during the execution time.
It means the tasks in the Conditional Task's list are the candidates and one of them will
be chosen to run during the run time. First, the original Common Join Task should be put into
the list. Also the query processor will generate a series of Map Join Task by assuming each
of the input tables may be the big table. For example, '''''select  * from src1 x  join src2y
on x.key=y.key'''''. Both table '''''src2 '''''and '''''src1 '''''may be the big table, so
it will generate 2 Map Join Task. One is assuming src1 is the big table and the other is assuming
src2 is the big table, as shown in Fig 6.
  
  {{attachment:fig6.jpg||height="778px",width="1072px"}}
  

Mime
View raw message