hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sriranjan Manjunath (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-1143) Poisson Sample Loader should compute the number of samples required only once
Date Fri, 11 Dec 2009 01:17:18 GMT

    [ https://issues.apache.org/jira/browse/PIG-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789063#action_12789063
] 

Sriranjan Manjunath commented on PIG-1143:
------------------------------------------

I am OK with using InputSplits.getLength() as long as these provide you a good estimate of
the file size. Without the population size, poisson samplers do now work well.

Samplers expect the data to be in BinStorage. If not, the first job reads it and stores it
into BinStorage. The only exception being if the join follows a load/store only MR job.


> Poisson Sample Loader should compute the number of samples required only once
> -----------------------------------------------------------------------------
>
>                 Key: PIG-1143
>                 URL: https://issues.apache.org/jira/browse/PIG-1143
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Sriranjan Manjunath
>            Assignee: Sriranjan Manjunath
>
> The current poisson sampler forces each of the maps to compute the sample number. This
is redundant and causes issues when a large directory is specified in the join. The sampler
should be changed to calculate the sample count only once and this information should be shared
with the remaining mappers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message