hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Olga Natkovich" <ol...@yahoo-inc.com>
Subject RE: Comparison between Tuple compare & WritableComparabale compare
Date Thu, 22 May 2008 22:08:12 GMT
 

> -----Original Message-----
> From: Alan Gates [mailto:gates@yahoo-inc.com] 
> Sent: Thursday, May 22, 2008 2:01 PM
> To: Shravan Narayanamurthy
> Cc: pig-dev@incubator.apache.org
> Subject: Re: Comparison between Tuple compare & 
> WritableComparabale compare
> 
> Clearly we should be thinking about exec time.  And having to 
> load one less bag into memory should greatly reduce exec 
> time, at least in the case where we can't fit that bag into 
> memory and we have to spill.  I have no idea of how to 
> compare and say which is a better performance gain.

In adition to performance, this can mean failing/succeeding on some
joins. If we can't bring key into the memory, we can still process the
query if we just stream through the data.

> 
> A few thoughts:
> 
> 1) We're in the boat of using tuples anytime a user groups, 
> cogroups, or sorts on more than one column and for all 
> distincts, correct?  So we have this problem at least in some 
> cases, no matter what.
> 
> 2) In the previous code, we had switched from using the tuple 
> object comparator to using a binary comparator provided by 
> hadoop.  This gave us a large speed up.  Are we still using 
> that binary comparator?

There is no reason why we should not use binary comparator!

> 
> 3) We need to take a look at the tuple and see what is taking 
> so long.  
> Are we spending time constructing the tuples vs hadoops 
> WritableComparable types, time comparing them, etc.
> 
> Alan.
> 
> Shravan Narayanamurthy wrote:
> > I completely messed up the calculation of speed reduction. 
> Sorry. The 
> > 30 to 40 times speed reduction in comparison time leads to the same 
> > reduction in speed even when we do n log n comparisons :)
> >  
> > Still don't you think its a high price to pay just to go 
> from n to n-1 bags. I agree that memory savings can be huge 
> but shouldn't we also be thinking about exec time?
> >  
> > Thanks,
> > --Shravan
> >
> > ________________________________
> >
> > From: Shravan Narayanamurthy
> > Sent: Thu 5/22/2008 11:35 PM
> > To: Alan Gates
> > Subject: Comparison between Tuple compare & WritableComparabale 
> > compare
> >
> >
> >
> > Hi Alan,
> > Comparing the times to compare two WritableComparables a 
> million with 
> > the time to compare the same objects when embedded in a Tuple. Also 
> > the Tuple has two elements. First one is the index and the 
> second one 
> > is the actual object:
> >
> > BOOLEAN : Tuple :: 14.16 : 602.76
> > BYTEARRAY : Tuple :: 53.94 : 414.06
> > CHARARRAY : Tuple :: 50.9 : 417.86
> > FLOAT : Tuple :: 20.2 : 655.4
> > INTEGER : Tuple :: 14.24 : 539.3
> > LONG : Tuple :: 16.08 : 578.6
> >
> >
> > The numbers surely look depressing. I was wondering if its 
> a good idea 
> > to do the (n-1) bag optimization at all. Because with just 
> adding two 
> > inputs into the cogroup, would make us send tuples as keys and this 
> > incurring nearly 30 to 40 times reduced speed just for comparing. 
> > Since we are sorting we will do n log n comparisons thus 
> incurring 150 
> > to 200 times reduction in speed. Joins being pretty 
> commonly used, I 
> > feel we should avoid this optimization.
> >
> > Thanks,
> > --Shravan
> >
> >
> >
> >
> >   
> 

Mime
View raw message