pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "David Ciemiewicz (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-821) simulate NTILE(n) , rank() functionality in pig
Date Sun, 20 Feb 2011 00:41:38 GMT

    [ https://issues.apache.org/jira/browse/PIG-821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12996938#comment-12996938
] 

David Ciemiewicz commented on PIG-821:
--------------------------------------

I tried doing an NTILE and generic UDF for computing cummulative statistics such as decile,
rank, denserank, etc for working on query streams.  With a billion or more queries, the serialization
of sequentially processing every single row was a bit slow.

What I realized was that the frequency histogram for something like a "powerlaw" distribution
of a billion queries would be pretty compact (something like only 20K unique frequencies).
 Computing sequential commulative statistics on 20K, 100K, or even 1M rows is far more efficient
than doing it for >1B rows.

So I used map reduce to compute the frequency histogram, compute the deciles on the frequency
histogram, then map them back to the individual queries which then exploits the parallelism
of map-reduce.

With the introduction of Jython scripting for UDFs in Pig 0.8, it makes it pretty simple to
distribute the example.  Furthermore, once inline Jython scripts are implemented (as requested
in another JIRA, this will be even more convenient.

Here's the Pig:
{code}

register 'ntiles.py' using jython as ntiles;

Queries = load 'queries.txt' using PigStorage() as ( query: chararray, freq: long );

FreqGroup = group Queries by ( freq );

FreqHisto = foreach FreqGroup generate
	group as freq,
	COUNT(Queries) as count;

FreqHistoAll = group FreqHisto all;

CummStats = foreach FreqHistoAll {
	FreqHistoOrdered = order FreqHisto by freq desc;
	generate
	FLATTEN(ntiles.freqhisto_cummstats(FreqHistoOrdered, 10));
	};

QueryDeciles = join Queries by freq, CummStats by freq;

QueryDeciles = order QueryDeciles by freq desc, query;

store QueryDeciles into 'query-deciles.txt' using PigStorage();
{code}

Here's the Jython function for computing the cumulative stats on the frequency historgram:
{code}

@outputSchema("freqhisto_cummstats:bag{stats:tuple(freq:long,count:long,total:long,proportion:double,cummcount:long,cummproportion:double,ntile:long,rank:long,denserank:long)}")
def freqhisto_cummstats(freqhisto, ntiles):
  outBag = []
  totalcount = 0
  for freqcount in freqhisto:
    freq = freqcount[0]
    count = freqcount[1]
    totalcount += freq * count
  cummcount = 0
  ntile = 1
  rank = 1
  denserank = 1
  for freqcount in freqhisto:
    freq = freqcount[0]
    count = freqcount[1]
    cummcount += freq * count
    proportion = float(freq) * count / totalcount
    cummproportion = float(cummcount) / totalcount
    if cummproportion >= float(ntile)/ntiles:
      ntile += 1
      if ntile > ntiles:
        ntile = ntiles
    tup=(freq, count, totalcount, proportion, cummcount, cummproportion, ntile, rank, denserank)
    outBag.append(tup)
    rank += 1
    denserank += freq * count
  return outBag
{code}

Here's a sample Perl program to generate "faux" query log data:
{code}

#! /usr/bin/perl

use Digest::MD5 qw(md5 md5_hex md5_base64);
$MAX = 1001;

foreach $i ( 1 .. $MAX ) {
    foreach $j ( 1 .. $i ) {
	$digest = md5_base64($i*$MAX+$j);
	print join("\t", $digest, $MAX-$i), "\n";
    }
}
{code}

> simulate NTILE(n) , rank() functionality in pig
> -----------------------------------------------
>
>                 Key: PIG-821
>                 URL: https://issues.apache.org/jira/browse/PIG-821
>             Project: Pig
>          Issue Type: New Feature
>          Components: impl
>    Affects Versions: 0.2.0
>         Environment: mithril gold -gateway 4000
>            Reporter: Rekha
>
> Hi,
> I came across a job which has some processing which I cant seem to get easily over-the-counter
from pig.
> These are NTILE() /rank() operations available in oracle.
> While I am trying to write a UDF, that is not working out too well for me yet.. :(
> I have a ntile(n) over (partititon by x, y, z order by a desc, b desc) operation to be
done in pig scripts.
> Is there a default function in pig scripting which can do this?
> For example, lets consider a simple example at http://download.oracle.com/docs/cd/B14117_01/server.101/b10759/functions091.htm
> So here, how would we ideally substitute NTILE() with? any pig counterpart function/udf?
> SELECT last_name, salary, NTILE(4) OVER (ORDER BY salary DESC) 
>    AS quartile FROM employees
>    WHERE department_id = 100;
>  
> LAST_NAME                     SALARY   QUARTILE
> ------------------------- ---------- ----------
> Greenberg                      12000          1
> Faviet                          9000          1
> Chen                            8200          2
> Urman                           7800          2
> Sciarra                         7700          3
> Popp                            6900          4
>  
> In real case, i have ntile over multiple columns, so ideal way to find histograms/boundary/spitting
out the bucket number is needed.
> Similarly a pig function is required for rank() over(partition by a,b,c order by d desc)
as e
> Please let me know soon.
> Thanks & Regards,
> /Rekha

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message