hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hadoop QA (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-733) Order by sampling dumps entire sample to hdfs which causes dfs "FileSystem closed" error on large input
Date Tue, 07 Apr 2009 09:05:12 GMT

    [ https://issues.apache.org/jira/browse/PIG-733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12696442#action_12696442

Hadoop QA commented on PIG-733:

-1 overall.  Here are the results of testing the latest attachment 
  against trunk revision 759376.

    +1 @author.  The patch does not contain any @author tags.

    -1 tests included.  The patch doesn't appear to include any new or modified tests.
                        Please justify why no tests are needed for this patch.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    -1 findbugs.  The patch appears to introduce 5 new Findbugs warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit

    -1 core tests.  The patch failed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/21/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/21/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/21/console

This message is automatically generated.

> Order by sampling dumps entire sample to hdfs which causes dfs "FileSystem closed" error
on large input
> -------------------------------------------------------------------------------------------------------
>                 Key: PIG-733
>                 URL: https://issues.apache.org/jira/browse/PIG-733
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.2.0
>            Reporter: Pradeep Kamath
>            Assignee: Pradeep Kamath
>             Fix For: 0.3.0
>         Attachments: PIG-733-v2.patch, PIG-733.patch
> Order by has a sampling job which samples the input and creates a sorted list of sample
items. CUrrently the number of items sampled is 100 per map task. So if the input is large
resulting in many maps (say 50,000) the sample is big. This sorted sample is stored on dfs.
The WeightedRangePartitioner computes quantile boundaries and weighted probabilities for repeating
values in each map by reading the samples file from DFS. In queries with many maps (in the
order of 50,000) the dfs read of the sample file fails with "FileSystem closed" error. This
seems to point to a dfs issue wherein a big dfs file being read simultaneously by many dfs
clients (in this case all maps) causes the clients to be closed. However on the pig side,
loading the sample from each map in the final map reduce job and computing the quantile boundaries
and weighted probabilities is inefficient. We should do this computation through a FindQuantiles
udf in the same map reduce job which produces the sorted samples. This way lesser data is
written to dfs and in the final map reduce job, the weightedRangePartitioner needs to just
load the computed information.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message