hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alan Gates <ga...@yahoo-inc.com>
Subject Re: does EvalFunc generate the entire bag always ?
Date Thu, 27 May 2010 21:32:53 GMT
The default case is that a UDFs that take bags (such as COUNT, etc.)  
are handed the entire bag at once.  In the case where all UDFs in a  
foreach implement the algebraic interface and the expression itself is  
algebraic than the combiner will be used, thus significantly limiting  
the size of the bag handed to the UDF.  The accumulator does hand  
records to the UDF a few thousand at a time.  Currently it has no way  
to turn off the flow of records.

What you want might be accomplished by the LIMIT operator, which can  
be used inside a nested foreach.  Something like:

C = foreach B {
	C1 = sort A by $0;
	C2 = limit 5 C1;
	generate myUDF(C2);


On May 26, 2010, at 11:59 AM, hc busy wrote:

> Hey, guys, how are Bags passed to EvalFunc stored?
> I was looking at the Accumulator interface and it says that the  
> reason why
> this needed for COUNT and SUM is because EvalFunc always gives you the
> entire bag when the EvalFunc is run on a bag.
> I always thought if I did COUNT(TABLE) or SUM(TABLE.FIELD), and the  
> code
> inside that does
> for(Tuple entry:inputDataBag){
> .... stuff
> }
> was an actual iterator that iterated on the bag sequentially without
> necessarily having the entire bag in memory all at once. ?? Because  
> it's an
> iterator, so there's no way to do anything other than to stream  
> through it.
> I'm looking at this because Accumulator has no way of telling Pig  
> "I've seen
> enough" It streams through the entire bag no matter what happens.  
> (like,
> hypothetically speaking, if I was writing "5th item of a sorted bag"  
> udf),
> after I see 5th of a 5 million entry bag, I want to stop executing if
> possible.
> Is there a easy way to make this happen?

View raw message