[ https://issues.apache.org/jira/browse/SPARK21057?page=com.atlassian.jira.plugin.system.issuetabpanels:alltabpanel
]
Lovasoa updated SPARK21057:

Description:
I was reading the source of Spark, and found this:
https://github.com/apache/spark/blob/v2.1.1/core/src/main/scala/org/apache/spark/partial/CountEvaluator.scala#L50L72
This is the function that estimates the probability distribution of the total count of elements
in an RDD given the count of only some partitions.
This function does a strange thing: when the number of elements counted so far is less than
10 000, it models the total count with a negative binomial (Pascal) law, else, it models it
with a Poisson law.
Modeling our number of uncounted elements with a negative binomial law is like saying that
we ran over elements, counting only some, and stopping after having counted a given number
of elements.
But this does not model what really happened. Our counting was limited in time, not in number
of counted elements, and we can count only some of the elements in a partition.
I propose to use the Poisson distribution in every case, as it can be justified under the
hypothesis that the number of elements in each partition is independent and follows a Poisson
law.
was:
I was reading the source of Spark, and found this:
https://github.com/apache/spark/blob/v2.1.1/core/src/main/scala/org/apache/spark/partial/CountEvaluator.scala#L50L72
This is the function that estimates the probability distribution of the total count of elements
in an RDD given the count of only some partitions.
This function does a strange thing: when the number of elements counted so far is less than
10 000, it models the total count with a negative binomial (Pascal) law, else, it models it
with a Poisson law.
Modeling our number of uncounted elements with a negative binomial law is like saying that
we ran over elements, counting only some, and stopping after having counted a given number
of elements.
But this does not model what really happened. Our counting was limited in time, not in number
of counted elements.
I propose to use the Poisson distribution in every case, as it can be justified under the
hypothesis that the number of elements in each partition is independent and follows a Poisson
law.
> Do not use a PascalDistribution in countApprox
> 
>
> Key: SPARK21057
> URL: https://issues.apache.org/jira/browse/SPARK21057
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Affects Versions: 2.1.1
> Reporter: Lovasoa
>
> I was reading the source of Spark, and found this:
> https://github.com/apache/spark/blob/v2.1.1/core/src/main/scala/org/apache/spark/partial/CountEvaluator.scala#L50L72
> This is the function that estimates the probability distribution of the total count of
elements in an RDD given the count of only some partitions.
> This function does a strange thing: when the number of elements counted so far is less
than 10 000, it models the total count with a negative binomial (Pascal) law, else, it models
it with a Poisson law.
> Modeling our number of uncounted elements with a negative binomial law is like saying
that we ran over elements, counting only some, and stopping after having counted a given number
of elements.
> But this does not model what really happened. Our counting was limited in time, not
in number of counted elements, and we can count only some of the elements in a partition.
> I propose to use the Poisson distribution in every case, as it can be justified under
the hypothesis that the number of elements in each partition is independent and follows a
Poisson law.

This message was sent by Atlassian JIRA
(v6.3.15#6346)

To unsubscribe, email: issuesunsubscribe@spark.apache.org
For additional commands, email: issueshelp@spark.apache.org
