hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shravan Matthur Narayanamurthy (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-285) custom compare functions is ignored
Date Tue, 08 Jul 2008 19:22:33 GMT

    [ https://issues.apache.org/jira/browse/PIG-285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12611759#action_12611759
] 

Shravan Matthur Narayanamurthy commented on PIG-285:
----------------------------------------------------

I have found the issue. Before describing it let me give some background so that fixing other
related issues is simpler:

The Order by Clause is handled with Quantiles. So a job which has an order by occuring in
the main plan is run as multiple jobs:
1. Store the output till the order by.
2. Run a quantile job to find the quantiles
3. Run the sort job.

The following should be the pig-script version of getQuantileJob in MRCompiler
{noformat}
A = load fSpec using RandomSampleLoader
B = foreach A generate flatten(col1), flatten(col2), ...
C = group all
D = foreach C {
	D1 = order $1 by *;
	generate requestedParallelism,D1;
}
E = foreach D generate FindQuantiles(*);
store E into quantFiles
{noformat}

The getSortJob should look something like this
{noformat}
A = load fSpec using BinStorage
B = group A by (col1,col2,...);
C = foreach B generate flatten(A);
{noformat}

C should have the output of ORDER BY

Also, the sort job should have some key things turned on in Hadoop for it to work:
1. Use the SortPartitioner as the key partitioner which internally uses the quantile file
generated by the quantile job
2. Also supply any user defined comparator to hadoop as the output key comparator

That is the ideal thing to do. The issue was the following:
Since the quantile job physical plan was hand crafted, it had the plan for the following instead
of what it should have been:
{noformat}
A = load fSpec using RandomSampleLoader
B = foreach A generate flatten(col1), flatten(col2), ...
C = group all
D = foreach C {
	generate requestedParallelism,$1;
}
E = foreach D generate FindQuantiles(*);
store E into quantFiles;
{noformat}

Hence instead of the sorted output, all we saw was the grouped output and probably some incorrect
results as the quantiles might have got messed if a parallel statement was used along with
the order by which is the cause for [Pig-292|https://issues.apache.org/jira/browse/PIG-292].

The other part was that the user defined comparator was not being passed as the output key
comparator which is the cause of the current bug. Another thing that led to us not finding
the bug early was an error in the testSort test case which I have corrected in [Pig-295|https://issues.apache.org/jira/browse/PIG-295].

To resolve this issue, first corrected the quantile job to include the order by in the nested
plan. However this caused issues with deserializing POUserComparisonFunc which extended from
POUserFunc. The issue was because when POUserComparisonFunc was deserialized POUserFunc got
deserialized first and tried to instantiate EvalFunc from a ComparisonFunc spec. To resolve
this, I had to make POUserComparisonFunc independent of POUserFunc and here I have made the
assumption that ComparisonFunc is used only in ORDER BY and not elsewhere. This corresponds
to all the extraneous things in the patch.

The next thing I did was to try to correct the missing supply of user defined comparator to
Hadoop as the key comparator. However, this causes issues:
We assume that ComparisonFunc always compares Tuples. However, with the inclusion of types,
we do not always wrap everything into a tuple and instead try to use the basic types wherever
possible. The patch I am going to submit does not address this part. The patch will assume
that issue with ComparisonFunc will be fixed and directly sets the user defined comparator
as the output key comparator. This will for the time being cause all user defined comparisons
to fail.

Some hints on the ComparisonFunc issue:
1. The soln should take into consideration that sometimes ComparisonFunc are generic and need
not know the schema of the input. Ex. OrdDesc
2. Many a times however, if its not a generic ComparisonFunc, we can assume that schema is
known.
3. The ComparisonFunc will have to work with hadoop types and not pig types as it would be
used in the boundary between LR & Pkg

Currently, ComparisonFunc extends WritableComparator and gives a concrete implementation that
delegates all compare(WritableComparable,WritableComparable) calls to compare(Tuple,Tuple).
Instead if we leave the compare(WritableComparable,WritableComparable) abstract I feel it
should solve the problem and users can provide an implementation of the compare for the type
that they are expecting. Will attach a patch shortly.

> custom compare functions is ignored
> -----------------------------------
>
>                 Key: PIG-285
>                 URL: https://issues.apache.org/jira/browse/PIG-285
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: types_branch
>            Reporter: Olga Natkovich
>            Assignee: Shravan Matthur Narayanamurthy
>
> The following query successfully runs but the results don't come in the correct order:
> a = load 'studenttab10k';
> c = order a by $0 using org.apache.pig.test.udf.orderby.OrdDesc;
> store c into ;out';
> results:
> alice allen     27      1.950
> alice allen     42      2.460
> alice allen     38      0.810
> alice allen     68      3.390
> alice allen     77      2.520
> alice allen     36      2.270
> .....
> expcted:
> zach zipper     66      2.670
> zach zipper     47      2.920
> zach zipper     19      1.910
> zach zipper     23      1.120
> zach zipper     40      2.030
> zach zipper     59      2.530
> .....

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message