hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sriranjan Manjunath (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-1143) Poisson Sample Loader should compute the number of samples required only once
Date Thu, 10 Dec 2009 22:26:18 GMT

    [ https://issues.apache.org/jira/browse/PIG-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12788971#action_12788971

Sriranjan Manjunath commented on PIG-1143:

To describe the problem in more detail, the current implementation does not handle a glob
efficiently. When the sample loader encounters a directory (or combinations thereof), it gets
the element descriptors of all the files inside the directory to compute the file sizes.
For ex: A = load "{view, click}" will result in computing file sizes of all the files underneath
both "view" and "click" directories. If we have a large number of mappers, this will result
in a ton of hdfs system calls, clogging the name node.

I intend to modify Poisson Sample Loader as follows. The algorithm for computing the total
number of samples remains the same. However, it will not be computed by every mapper. I will
be using the UDFContext object to share this information across mappers. Since mapper/ reducers
can only read the information from UDFContext, the slicer will store this information. The
slicer will compute the sampler count for the first map. As before, PigSlice will call computeSamples()
for the first map. It will then store this value as a property in the UDFContext object. The
Slicer will check UDFContext to see if this value is set and if it is, it will use it instead
of computing it again. I intend to use "pig.input.0.sampleCount" as the key.

This solution will reduce the fileSize() invocations to a minimum and should reduce the burden
on the name node.

> Poisson Sample Loader should compute the number of samples required only once
> -----------------------------------------------------------------------------
>                 Key: PIG-1143
>                 URL: https://issues.apache.org/jira/browse/PIG-1143
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Sriranjan Manjunath
>            Assignee: Sriranjan Manjunath
> The current poisson sampler forces each of the maps to compute the sample number. This
is redundant and causes issues when a large directory is specified in the join. The sampler
should be changed to calculate the sample count only once and this information should be shared
with the remaining mappers.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message