hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ankur (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-821) simulate NTILE(n) , rank() functionality in pig
Date Tue, 09 Jun 2009 10:25:07 GMT

    [ https://issues.apache.org/jira/browse/PIG-821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12717635#action_12717635

Ankur commented on PIG-821:

Ok, So I tried writing an NTILE UDF that accepts 
1. Number of tiles 
2. A bag of sorted tuples

The problem with that is it is essentially a serial process instead of parallel as one would
expect. So I am not sure if an NTILE operation can be done efficiently via a UDF. An efficient
NTILE operation over sorted dataset should 
1. Partition the sorted data into the number of tiles requested
2. Preserve the ordering in each tile.
3. Have each tile contain exactly the number of elements as per ntile logic.

There is a total ordering partitioner in hadoop - http://issues.apache.org/jira/browse/HADOOP-3019
that effects total ordering of output data. However it cannot strictly enforce the number
of elements contained in each part output which is a necessary condition to comply with NTILE

Any thoughts?

> simulate NTILE(n) , rank() functionality in pig
> -----------------------------------------------
>                 Key: PIG-821
>                 URL: https://issues.apache.org/jira/browse/PIG-821
>             Project: Pig
>          Issue Type: New Feature
>          Components: impl
>    Affects Versions: 0.2.0
>         Environment: mithril gold -gateway 4000
>            Reporter: Rekha
>             Fix For: 0.2.0
> Hi,
> I came across a job which has some processing which I cant seem to get easily over-the-counter
from pig.
> These are NTILE() /rank() operations available in oracle.
> While I am trying to write a UDF, that is not working out too well for me yet.. :(
> I have a ntile(n) over (partititon by x, y, z order by a desc, b desc) operation to be
done in pig scripts.
> Is there a default function in pig scripting which can do this?
> For example, lets consider a simple example at http://download.oracle.com/docs/cd/B14117_01/server.101/b10759/functions091.htm
> So here, how would we ideally substitute NTILE() with? any pig counterpart function/udf?
> SELECT last_name, salary, NTILE(4) OVER (ORDER BY salary DESC) 
>    AS quartile FROM employees
>    WHERE department_id = 100;
> LAST_NAME                     SALARY   QUARTILE
> ------------------------- ---------- ----------
> Greenberg                      12000          1
> Faviet                          9000          1
> Chen                            8200          2
> Urman                           7800          2
> Sciarra                         7700          3
> Popp                            6900          4
> In real case, i have ntile over multiple columns, so ideal way to find histograms/boundary/spitting
out the bucket number is needed.
> Similarly a pig function is required for rank() over(partition by a,b,c order by d desc)
as e
> Please let me know soon.
> Thanks & Regards,
> /Rekha

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message