pig-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Xiaomeng Wan <shawn...@gmail.com>
Subject Re: Filter grouped data with two percentile
Date Wed, 14 Sep 2011 16:26:44 GMT
Pierre,

Union is not allowed within foreach. Fortunately, you donot need it. I
just realize the code I give you doesnot generate what you want,
actually it generates the complement of what you want. Try something
like this:

a = group records by id;
b = foreach a {




On Wed, Sep 14, 2011 at 10:09 AM, Pierre-Luc Brunet
<pierre-ml@zestuff.com> wrote:
> Shawn,
>
> This looks indeed pretty good except for one thing.
>
> I already do a GROUP on my table in order to group my records by "item". If I run your
code, I end up filtering against the entire data set instead of filtering each group individually.
 I tried to put your code inside a foreach statement without much luck.
>
> Any idea?
>
> --
> Pierre
>
>
> On 2011-09-13, at 5:49 PM, Xiaomeng Wan wrote:
>
>> try this:
>>
>> a = group allrecords all;
>> b = foreach a generate COUNT(allrecords) as total; //or COUNT_STAR
>>
>> c = order allrecords by thevalue;
>> d = limit c b.total*0.05;
>>
>> e = order allrecords by thevalue desc;
>> f = limit e b.total*0.05;
>>
>> g = union d, f;
>>
>> Shawn
>>
>> On Tue, Sep 13, 2011 at 2:39 PM, Pierre-Luc Brunet <pierre@zestuff.com> wrote:
>>> Question.
>>>
>>> What would be the best way in Pig to grab a set of data, find the record that
matches the 5th percentile, find the record that matches the 95th percentile and throw away
what's before and after that?
>>>
>>> Obviously, my math doesn't work for this.
>>>
>>> In hope that it helps clarifying what I'm trying to do, here's how I currently
do it in Javascript:
>>>
>>> result.bid_array.sort(function(a,b) { return a - b; });
>>> var bid_p5 = Math.round(5/100 * result.bid_array.length);
>>> var bid_p95 = Math.round(95/100 * result.bid_array.length);
>>>
>>> result.bid_array.splice(bid_p95, result.bid_array.length - ( result.bid_array.length
- bid_p95));
>>> result.bid_array.splice(0, bid_p5);
>>>
>>> --
>>> Pierre-Luc Brunet
>>> ZeStuff - http://www.zestuff.com
>>>
>>> Phone: (877) 5ZESTUFF
>>> Mobile: (514) 600-0234
>>> Email: pierre@zestuff.com
>>>
>>> 9320 Saint-Laurent, #502
>>> Montreal, QC, Canada, H2N 1N7
>>>
>>> On 2011-09-08, at 10:49 PM, Dmitriy Ryaboy wrote:
>>>
>>>> If you look at the data for #25 you posted below, you will find that there
>>>> is no row such that the price is between 5 and 95%!
>>>> khadgar is such an extreme outlier, it moves the 5% line above everyone
>>>> else, and of course it itself sets the 100% line.
>>>>
>>>> D
>>>>
>>>> On Thu, Sep 8, 2011 at 7:03 PM, Pierre-Luc Brunet <pierre@zestuff.com>wrote:
>>>>
>>>>> That worked except that for some reason, there's a lot of data that is
>>>>> missing in the final output (compared to what it should return).
>>>>>
>>>>> For example, the file I load has these lines:
>>>>>
>>>>> 7       25      us      darkspear       a       Redacted
       4750
>>>>> 5000    1
>>>>> 8       25      us      emerald-dream   a       Lornadoome
     9500
>>>>> 10000   1
>>>>> 21      25      eu      khadgar a       Haiibanklol    
769499  809999  1
>>>>> 7       25      us      queldorei       a       Worfgt
 27862   34827   1
>>>>> 3       25      us      antonidas       a       Oldcrafter
     19000
>>>>> 20000   1
>>>>>
>>>>> However, when I load up the script http://pastebin.com/Bk8RBAHt (now
>>>>> grouped on only one column), I don't have any records with 25 as the
key.
>>>>> The first 5 rows in my tsv files are
>>>>>
>>>>> 35      3.19973415E7
>>>>> 36      122914.0
>>>>> 37      50000.0
>>>>> 38      416099.9
>>>>> 39      901333.8571428572
>>>>> 43      191496.5
>>>>> 44      236454.0
>>>>>
>>>>>
>>>>> I really have no idea where the missing rows went :\
>>>>>
>>>>> --
>>>>> Pierre-Luc Brunet
>>>>> ZeStuff - http://www.zestuff.com
>>>>>
>>>>> Phone: (877) 5ZESTUFF
>>>>> Mobile: (514) 600-0234
>>>>> Email: pierre@zestuff.com
>>>>>
>>>>> 9320 Saint-Laurent, #502
>>>>> Montreal, QC, Canada, H2N 1N7
>>>>>
>>>>> On 2011-09-08, at 8:45 PM, Xiaomeng Wan wrote:
>>>>>
>>>>>> you can change
>>>>>>
>>>>>> GENERATE group, auctionsPrice.price AS price:tuple, p5 AS p5, p95
AS p95;
>>>>>> to
>>>>>> GENERATE FLATTEN(group) as (item, region, realm, faction),
>>>>>> FLATTEN(auctionsPrice.price) AS price, p5 AS p5, p95 AS p95;
>>>>>>
>>>>>> then regroup after the foreach block
>>>>>>
>>>>>> p2 = FILTER p1 BY (price >= p5 AND price <= p95);
>>>>>> p2a = group p2 by (item, region, realm, faction);
>>>>>> p3 = FOREACH p2a GENERATE group, AVG(p2.price) AS price;
>>>>>>
>>>>>> or write you own UDF to get the average within the foreach block.
It
>>>>>> would be ideal if we can move p2 statement into the foreach block
like
>>>>>> this: p2 = filter autionsPrice by price >= p5 and price <=
p95, but i
>>>>>> donot think it is supported right now.
>>>>>>
>>>>>> Shawn
>>>>>>
>>>>>>
>>>>>> On Thu, Sep 8, 2011 at 5:54 PM, Pierre-Luc Brunet <pierre@zestuff.com>
>>>>> wrote:
>>>>>>> Heya!
>>>>>>>
>>>>>>> I've been trying to do something with Pig for about 4 days now
and I
>>>>> have nothing but failure to show for it. I was wondering if anybody could
>>>>> look at my queries and slap some sense into me? I've uploaded the queries
to
>>>>> pastebin: http://pastebin.com/kzMxYwrY
>>>>>>>
>>>>>>> In short, I want to take my data, group it by 4 fields, then
for each
>>>>> group, I want to:
>>>>>>> - Find out the 5th and the 95th percentile for the 'price'
>>>>>>> - Filter each group to remove the records that are < 5th percentile
and
>>>>>> 95 percentile.
>>>>>>>
>>>>>>> Then for each group, I want to grab the AVG() of what's left.
>>>>>>>
>>>>>>> I tried many variations of the same code and always ended up
with either
>>>>> "incompatible types in GreaterThanEqual Operator" or "Scalar has more
than
>>>>> one row in the output."
>>>>>>>
>>>>>>> Any help would be greatly appreciated. Thanks! :)
>>>>>>> --
>>>>>>> Pierre-Luc Brunet
>>>>>>>
>>>>>
>>>>>
>>>>>
>>>
>>>
>>>
>
>
>

Mime
View raw message