hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From hc busy <hc.b...@gmail.com>
Subject Re: does EvalFunc generate the entire bag always ?
Date Tue, 01 Jun 2010 20:44:31 GMT
well, see that's the thing, the 'sort A by $0' is already nlg(n)

ahh, I see, my own example suffers from this problem.

I guess I'm wondering how 'limit' works in conjunction with UDF's... A
practical application escapes me right now, But if I do

C = foreach B{
   C1 = MyUdf(B.bag_on_b);
   C2 = limit C1 5;

does it know to push limit in this case?

On Thu, May 27, 2010 at 2:32 PM, Alan Gates <gates@yahoo-inc.com> wrote:

> The default case is that a UDFs that take bags (such as COUNT, etc.) are
> handed the entire bag at once.  In the case where all UDFs in a foreach
> implement the algebraic interface and the expression itself is algebraic
> than the combiner will be used, thus significantly limiting the size of the
> bag handed to the UDF.  The accumulator does hand records to the UDF a few
> thousand at a time.  Currently it has no way to turn off the flow of
> records.
> What you want might be accomplished by the LIMIT operator, which can be
> used inside a nested foreach.  Something like:
> C = foreach B {
>        C1 = sort A by $0;
>        C2 = limit 5 C1;
>        generate myUDF(C2);
> }
> Alan.
> On May 26, 2010, at 11:59 AM, hc busy wrote:
>  Hey, guys, how are Bags passed to EvalFunc stored?
>> I was looking at the Accumulator interface and it says that the reason why
>> this needed for COUNT and SUM is because EvalFunc always gives you the
>> entire bag when the EvalFunc is run on a bag.
>> I always thought if I did COUNT(TABLE) or SUM(TABLE.FIELD), and the code
>> inside that does
>> for(Tuple entry:inputDataBag){
>> .... stuff
>> }
>> was an actual iterator that iterated on the bag sequentially without
>> necessarily having the entire bag in memory all at once. ?? Because it's
>> an
>> iterator, so there's no way to do anything other than to stream through
>> it.
>> I'm looking at this because Accumulator has no way of telling Pig "I've
>> seen
>> enough" It streams through the entire bag no matter what happens. (like,
>> hypothetically speaking, if I was writing "5th item of a sorted bag" udf),
>> after I see 5th of a 5 million entry bag, I want to stop executing if
>> possible.
>> Is there a easy way to make this happen?

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message