hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Chu <pete....@outlook.com>
Subject RE: Hive QL - NOT IN, NOT EXIST
Date Mon, 06 May 2013 17:56:32 GMT
Thanks Stephen,
Will start a cluster today to see if it helps.
Peter

Date: Mon, 6 May 2013 00:05:45 -0700
Subject: Re: Hive QL - NOT IN, NOT EXIST
From: javadba@gmail.com
To: user@hive.apache.org

Hi Peter,   Looks like mapjoin does not work with outer join so streamtable is instead a possible
approach. You would stream the larger table through the smaller one: 

 can you see whether the following helps your perf issue?
select /*+ streamtable(message) */ f.uuid  from message m right outer join feed f on m.uuid
= f.uuid where m.uuid = null;





2013/5/5 Peter Chu <pete.chu@outlook.com>




Thanks, Stephen,
I do not quite understand what you mean by Stream, specifically "Stream the 400M message records
through the in-memory maps".Can you please elaborate.

Also, can you use MAPJOIN on left outer join?
Peter

Date: Sun, 5 May 2013 21:44:37 -0700
Subject: Re: Hive QL - NOT IN, NOT EXIST

From: javadba@gmail.com
To: user@hive.apache.org



@Peter  Does the query plan demonstrate that the 3Meg row table is being map-joined and the
400M table streamed through? That is what you want: but you might either need to fiddle with
hints to get it to happen

Details:    Read uuids s of feed into  in-memory map on all nodes (mapjoin)     Stream the
400M message records through the in-memory maps, copying id's from the "all feed uuids"  map
to a  "matched feed uuid's map for entries that have matches in the messages 


     Note: this way the 400M rows are only read once on the cluster. 
You can see whether hive can manage this or if you write a custom m/r job to do it.
 

2013/5/5 Peter Chu <pete.chu@outlook.com>





It works but it takes a very long time because the subqueries in NOT IN contains 400 million
rows (the message table in the example) and the feed table contains 3 million rows.


SELECT uuid from feed f WHERE f.uuid NOT IN (SELECT uuid FROM message);


> Date: Sun, 5 May 2013 20:25:15 -0700
> From: michaelmalak@yahoo.com
> Subject: Re: Hive QL - NOT IN, NOT EXIST
> To: user@hive.apache.org


> 
> 
> --- On Sun, 5/5/13, Peter Chu <pete.chu@outlook.com> wrote:
> 
> > I am wondering if there is any way to do this without resorting to


> > using left outer join and finding nulls.
> 
> I have found this to be an acceptable substitute.  Is it not working for you?
> 
 		 	   		  

 		 	   		  

 		 	   		  
Mime
View raw message