hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "David Ciemiewicz (JIRA)" <j...@apache.org>
Subject [jira] Created: (PIG-826) DISTINCT as "Function" rather than statement - High Level Pig
Date Mon, 01 Jun 2009 04:56:07 GMT
DISTINCT as "Function" rather than statement - High Level Pig
-------------------------------------------------------------

                 Key: PIG-826
                 URL: https://issues.apache.org/jira/browse/PIG-826
             Project: Pig
          Issue Type: New Feature
            Reporter: David Ciemiewicz


In SQL, a user would think nothing of doing something like:

{code}
select
    COUNT(DISTINCT(user)) as user_count,
    COUNT(DISTINCT(country)) as country_count,
    COUNT(DISTINCT(url) as url_count
from
    server_logs;
{code}

But in Pig, we'd need to do something like the following.  And this is about the most
compact version I could come up with.

{code}
Logs = load 'log' using PigStorage()
        as ( user: chararray, country: chararray, url: chararray);

DistinctUsers = distinct (foreach Logs generate user);
DistinctCountries = distinct (foreach Logs generate country);
DistinctUrls = distinct (foreach Logs generate url);

DistinctUsersCount = foreach (group DistinctUsers all) generate
        group, COUNT(DistinctUsers) as user_count;
DistinctCountriesCount = foreach (group DistinctCountries all) generate
        group, COUNT(DistinctCountries) as country_count;
DistinctUrlCount = foreach (group DistinctUrls all) generate
        group, COUNT(DistinctUrls) as url_count;

AllDistinctCounts = cross
        DistinctUsersCount, DistinctCountriesCount, DistinctUrlCount;

Report = foreach AllDistinctCounts generate
        DistinctUsersCount::user_count,
        DistinctCountriesCount::country_count,
        DistinctUrlCount::url_count;

store Report into 'log_report' using PigStorage();
{code}

It would be good if there was a higher level version of Pig that permitted code to be written
as:

{code}
Logs = load 'log' using PigStorage()
        as ( user: chararray, country: chararray, url: chararray);

Report = overall Logs generate
        COUNT(DISTINCT(user)) as user_count,
        COUNT(DISTINCT(country)) as country_count,
        COUNT(DISTINCT(url)) as url_count;

store Report into 'log_report' using PigStorage();
{code}

I do want this in Pig and not as SQL.  I'd expect High Level Pig to generate Lower Level Pig.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message