[ https://issues.apache.org/jira/browse/CASSANDRA8883?page=com.atlassian.jira.plugin.system.issuetabpanels:alltabpanel
]
Carl Yeksigian updated CASSANDRA8883:

Attachment: 88832.1.txt
Since numpy has access to the original values, it provides interpolation between the points
if the percentile isn't exactly on a boundary:
{code}
np.percentile(np.array([1, 2, 3, 4, 5, 6]), 50)
==> 3.5
{code}
Since we are using the histogram, we don't really know where that lands, so we just need to
return a value inside of the range. Currently we are returning the end of the range before
where the percentile occurs.
I've changed EstimatedHistogram to use ceil instead of floor, and updated the tests accordingly.
> Percentile computation should use ceil not floor in EstimatedHistogram
> 
>
> Key: CASSANDRA8883
> URL: https://issues.apache.org/jira/browse/CASSANDRA8883
> Project: Cassandra
> Issue Type: Bug
> Components: Core
> Reporter: Chris Lohfink
> Assignee: Carl Yeksigian
> Priority: Minor
> Fix For: 2.1.4
>
> Attachments: 88832.1.txt
>
>
> When computing the pcount Cassandra uses floor and the comparison with elements is >=
so given a simple example of there being a total of five elements
> {code}
> // data
> [1, 1, 1, 1, 1]
> // offsets
> [1, 2, 3, 4, 5]
> {code}
> Cassandra would report the 50th percentile as 2. While 3 is the more expected value.
As a comparison using numpy
> {code}
> import numpy as np
> np.percentile(np.array([1, 2, 3, 4, 5]), 50)
> ==> 3.0
> {code}
> The percentiles was added in CASSANDRA4022 but is now used a lot in metrics Cassandra
reports. I think it should error on the side on overestimating instead of underestimating.

This message was sent by Atlassian JIRA
(v6.3.4#6332)
