hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Pradeep Kamath (JIRA)" <j...@apache.org>
Subject [jira] Created: (PIG-733) Order by sampling dumps entire sample to hdfs which causes dfs "FileSystem closed" error on large input
Date Wed, 25 Mar 2009 19:05:58 GMT
Order by sampling dumps entire sample to hdfs which causes dfs "FileSystem closed" error on
large input
-------------------------------------------------------------------------------------------------------

                 Key: PIG-733
                 URL: https://issues.apache.org/jira/browse/PIG-733
             Project: Pig
          Issue Type: Bug
            Reporter: Pradeep Kamath
            Assignee: Pradeep Kamath


Order by has a sampling job which samples the input and creates a sorted list of sample items.
CUrrently the number of items sampled is 100 per map task. So if the input is large resulting
in many maps (say 50,000) the sample is big. This sorted sample is stored on dfs. The WeightedRangePartitioner
computes quantile boundaries and weighted probabilities for repeating values in each map by
reading the samples file from DFS. In queries with many maps (in the order of 50,000) the
dfs read of the sample file fails with "FileSystem closed" error. This seems to point to a
dfs issue wherein a big dfs file being read simultaneously by many dfs clients (in this case
all maps) causes the clients to be closed. However on the pig side, loading the sample from
each map in the final map reduce job and computing the quantile boundaries and weighted probabilities
is inefficient. We should do this computation through a FindQuantiles udf in the same map
reduce job which produces the sorted samples. This way lesser data is written to dfs and in
the final map reduce job, the weightedRangePartitioner needs to just load the computed information.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message