hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alan Gates <ga...@yahoo-inc.com>
Subject Re: Comparison between Tuple compare & WritableComparabale compare
Date Thu, 22 May 2008 21:01:23 GMT
Clearly we should be thinking about exec time.  And having to load one 
less bag into memory should greatly reduce exec time, at least in the 
case where we can't fit that bag into memory and we have to spill.  I 
have no idea of how to compare and say which is a better performance gain.

A few thoughts:

1) We're in the boat of using tuples anytime a user groups, cogroups, or 
sorts on more than one column and for all distincts, correct?  So we 
have this problem at least in some cases, no matter what.

2) In the previous code, we had switched from using the tuple object 
comparator to using a binary comparator provided by hadoop.  This gave 
us a large speed up.  Are we still using that binary comparator?

3) We need to take a look at the tuple and see what is taking so long.  
Are we spending time constructing the tuples vs hadoops 
WritableComparable types, time comparing them, etc.


Shravan Narayanamurthy wrote:
> I completely messed up the calculation of speed reduction. Sorry. The 30 to 40 times
speed reduction in comparison time leads to the same reduction in speed even when we do n
log n comparisons :)
> Still don't you think its a high price to pay just to go from n to n-1 bags. I agree
that memory savings can be huge but shouldn't we also be thinking about exec time?
> Thanks,
> --Shravan
> ________________________________
> From: Shravan Narayanamurthy
> Sent: Thu 5/22/2008 11:35 PM
> To: Alan Gates
> Subject: Comparison between Tuple compare & WritableComparabale compare
> Hi Alan,
> Comparing the times to compare two WritableComparables a million with
> the time to compare the same objects when embedded in a Tuple. Also the
> Tuple has two elements. First one is the index and the second one is the
> actual object:
> BOOLEAN : Tuple :: 14.16 : 602.76
> BYTEARRAY : Tuple :: 53.94 : 414.06
> CHARARRAY : Tuple :: 50.9 : 417.86
> FLOAT : Tuple :: 20.2 : 655.4
> INTEGER : Tuple :: 14.24 : 539.3
> LONG : Tuple :: 16.08 : 578.6
> The numbers surely look depressing. I was wondering if its a good idea
> to do the (n-1) bag optimization at all. Because with just adding two
> inputs into the cogroup, would make us send tuples as keys and this
> incurring nearly 30 to 40 times reduced speed just for comparing. Since
> we are sorting we will do n log n comparisons thus incurring 150 to 200
> times reduction in speed. Joins being pretty commonly used, I feel we
> should avoid this optimization.
> Thanks,
> --Shravan

View raw message