hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sriranjan Manjunath (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-1143) Poisson Sample Loader should compute the number of samples required only once
Date Thu, 10 Dec 2009 23:32:18 GMT

    [ https://issues.apache.org/jira/browse/PIG-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789023#action_12789023

Sriranjan Manjunath commented on PIG-1143:

The file size in the documentation refers to the size on disk. In order to account for compression,
encoding etc. a configurable parameter - pig.inputfile.conversionfactor is provided. I agree
that this cannot be set to a good value for compressed data. It is just a guidance. The implications
of setting it to a bad value are minimal - we will end up sampling little more than the required
number of samples (unless you set it to a fraction).

> Poisson Sample Loader should compute the number of samples required only once
> -----------------------------------------------------------------------------
>                 Key: PIG-1143
>                 URL: https://issues.apache.org/jira/browse/PIG-1143
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Sriranjan Manjunath
>            Assignee: Sriranjan Manjunath
> The current poisson sampler forces each of the maps to compute the sample number. This
is redundant and causes issues when a large directory is specified in the join. The sampler
should be changed to calculate the sample count only once and this information should be shared
with the remaining mappers.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message