hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Richard Ding (JIRA)" <j...@apache.org>
Subject [jira] Updated: (PIG-1169) Top-N queries produce incorrect results when a store statement is added between order by and limit statement
Date Thu, 24 Dec 2009 18:06:29 GMT

     [ https://issues.apache.org/jira/browse/PIG-1169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Richard Ding updated PIG-1169:
------------------------------

    Description: 

??We tried to get top N results after a groupby and sort, and got different results with or
without storing the full sorted results. Here is a skeleton of our pig script.??

{code}
raw_data = Load '<input_files>' AS (f1, f2, ..., fn);
grouped = group raw_data by (f1, f2);
data = foreach grouped generate FLATTEN(group). SUM(raw_data.fk) as value;
ordered = order data by value DESC parallel 10;
topn = limit ordered 10;
store ordered into 'outputdir/full';
store topn into 'outputdir/topn';
{code}

??With the statement 'store ordered ...', top N results are incorrect, but without the statement,
results are correct. Has anyone seen this before? I know a similar bug has been fixed in the
multi-query release. We are on pig .4 and hadoop .20.1.??



  was:
Recently, a couple of problems related to the Top N queries were reported by users.

* From Chuang Liu:

We tried to get top N results after a groupby and sort, and got different results with or
without storing the full sorted results. Here is a skeleton of our pig script.

{code}
raw_data = Load '<input_files>' AS (f1, f2, ..., fn);
grouped = group raw_data by (f1, f2);
data = foreach grouped generate FLATTEN(group). SUM(raw_data.fk) as value;
ordered = order data by value DESC parallel 10;
topn = limit ordered 10;
store ordered into 'outputdir/full';
store topn into 'outputdir/topn';
{code}

With the statement 'store ordered ...', top N results are incorrect, but without the statement,
results are correct. Has anyone seen this before? I know a similar bug has been fixed in the
multi-query release. We are on pig
.4 and hadoop .20.1.

* From Corry Haines:

I am not sure if this is a bug, or something more subtle, but here is the problem that I am
having.

When I LOAD a dataset, change it with an ORDER, LIMIT it, then CROSS it with itself, the results
are not correct. I expect to see the cross of the limited, ordered dataset, but instead I
see the cross of the limited dataset. Effectively, its like the LIMIT is being excluded.

Pig Version: 0.5.0
Hadoop Version: 0.20.1

I would greatly appreciate some help, as this is somewhat frustrating.

Example code (and output) follows:

{code}
A = load 'foo' as (f1:int, f2:int, f3:int); B = load 'foo' as (f1:int, f2:int, f3:int);
a = ORDER A BY f1 DESC;
b = ORDER B BY f1 DESC;
aa = LIMIT a 1;
bb = LIMIT b 1;
C = CROSS aa, bb;
DUMP C;
{code}


        Summary: Top-N queries produce incorrect results when a store statement is added between
order by and limit statement  (was: Problems with some top N queries)

> Top-N queries produce incorrect results when a store statement is added between order
by and limit statement
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: PIG-1169
>                 URL: https://issues.apache.org/jira/browse/PIG-1169
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: 0.7.0
>            Reporter: Richard Ding
>            Assignee: Richard Ding
>
> ??We tried to get top N results after a groupby and sort, and got different results with
or without storing the full sorted results. Here is a skeleton of our pig script.??
> {code}
> raw_data = Load '<input_files>' AS (f1, f2, ..., fn);
> grouped = group raw_data by (f1, f2);
> data = foreach grouped generate FLATTEN(group). SUM(raw_data.fk) as value;
> ordered = order data by value DESC parallel 10;
> topn = limit ordered 10;
> store ordered into 'outputdir/full';
> store topn into 'outputdir/topn';
> {code}
> ??With the statement 'store ordered ...', top N results are incorrect, but without the
statement, results are correct. Has anyone seen this before? I know a similar bug has been
fixed in the multi-query release. We are on pig .4 and hadoop .20.1.??

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message