hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Thejas M Nair (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-1062) load-store-redesign branch: change SampleLoader and subclasses to work with new LoadFunc interface
Date Sat, 31 Oct 2009 01:02:59 GMT

    [ https://issues.apache.org/jira/browse/PIG-1062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772197#action_12772197

Thejas M Nair commented on PIG-1062:

I had overlooked the fact that input size of the file is being used also to calculate the
number of samples. Thanks for pointing it out.  

I don't know if there are any problems in using counters directly, as long as information
is required only after (first mapreduce) sampling phase, ie it could be used in PartitionSkewedKey().

The logic in PoissonSampleLoader.computeSamples is  ( a detailed explanation will be added
soon to the sampler wiki page). - The goal is to sample all keys from the first input that
are will need to be partitioned across multiple reducers in the join phase. 
Let us assume X tuples fit into available memory in reducer. Lets say we want to sample 10
samples in each set of X tuples, with 95% confidence. Using poisson distribution formulas,
we arrive at the number 17 as number of tuples to be sampled every X tuples. ( I don't know
why poisson distrubution is the appropriate choice )

The total number of tuples to be sampled cannot be calculated without knowing total number
of tuples. But what we know is that we should sample one tuple every (X/17) tuples. To calculate
X, we need the average size of tuple in memory. Using the process memory usage is unlikely
to give good approximation of that, because (as per my understanding) calling the garbage
collector is not guaranteed to free memory used by all unused objects.  Tuple.getMemorySize()
can be used to get an estimate of the memory used by the tuple. The average size could be
estimated/corrected as we sample more tuples.
ie, PoissonSampleLoader.getNext() will return every H/s tuple in the input. (using H, s in
previous comment)

In PartitionSkewedKey.exec(), Dmitriy's  idea of using number of samples, and sample rate
(H/s) can be used to estimate total tuples. 

WeightedRangePartitioner.setConf is another function using fileSize().  That needs to change
as well. I haven't looked at that yet.

> load-store-redesign branch: change SampleLoader and subclasses to work with new LoadFunc
> ---------------------------------------------------------------------------------------------------
>                 Key: PIG-1062
>                 URL: https://issues.apache.org/jira/browse/PIG-1062
>             Project: Pig
>          Issue Type: Sub-task
>            Reporter: Thejas M Nair
>            Assignee: Thejas M Nair
> This is part of the effort to implement new load store interfaces as laid out in http://wiki.apache.org/pig/LoadStoreRedesignProposal
> PigStorage and BinStorage are now working.
> SampleLoader and subclasses -RandomSampleLoader, PoissonSampleLoader need to be changed
to work with new LoadFunc interface.  
> Fixing SampleLoader and RandomSampleLoader will get order-by queries working.
> PoissonSampleLoader is used by skew join. 

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message