hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Thejas M Nair (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-1143) Poisson Sample Loader should compute the number of samples required only once
Date Thu, 10 Dec 2009 23:22:18 GMT

    [ https://issues.apache.org/jira/browse/PIG-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789013#action_12789013

Thejas M Nair commented on PIG-1143:

The PoissonSampleLoader implementation in Load-store redesign does not check the file size
and has a different approach for the following reason (as mentioned in PIG-1062) -

With new interfaces in load-store redesign, pig can compute the file size by adding up size
of each split (from InputSplit.getLenght()) . But the documentation of the function does not
make it clear if this is size on disk , compressed/uncompressed etc. Looks like it just needs
to be some number proportional to size of the file. Assuming it is size on disk (uncompressed),
using this to estimate the total memory it will require is tricky, one has to make assumptions
about the compression ratio and the serialization method.
Using Tuple.getMemorySize() while sampling will give more accurate numbers for reducer memory
that it will consume. 

> Poisson Sample Loader should compute the number of samples required only once
> -----------------------------------------------------------------------------
>                 Key: PIG-1143
>                 URL: https://issues.apache.org/jira/browse/PIG-1143
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Sriranjan Manjunath
>            Assignee: Sriranjan Manjunath
> The current poisson sampler forces each of the maps to compute the sample number. This
is redundant and causes issues when a large directory is specified in the join. The sampler
should be changed to calculate the sample count only once and this information should be shared
with the remaining mappers.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message