pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Goel, Ankur" <ankur.g...@corp.aol.com>
Subject Top-K for nested fields
Date Thu, 08 Jan 2009 09:32:36 GMT
Hi Folks,

              I have a case where-in I need to do top-K on nested fields
in my tuple. For e.g. Consider the following tuples (format is [url,

(abc.com, A)

(abc.com, A)

(abc.com, C)

(abc.com, B)

(xyz.com, D)

(xyz.com, D)

(xyz.com, E)


I need to be able to group by URL and output top-K queries along with
their count for each URL. So output would be 

Abc.com A 2

Abc.com B 1

Abc.com C 1



In my understanding we would do something like


url = GROUP tuples BY url;

result = FOREACH url GENERATE group, top(10, query)


Is there a UDF to do this? If not then I can write one and possibly


Is there any other way of doing it?




  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message