cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Carl Yeksigian (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (CASSANDRA-8883) Percentile computation should use ceil not floor in EstimatedHistogram
Date Tue, 03 Mar 2015 19:02:05 GMT

     [ https://issues.apache.org/jira/browse/CASSANDRA-8883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Carl Yeksigian updated CASSANDRA-8883:
--------------------------------------
    Attachment: 8883-2.1.txt

Since numpy has access to the original values, it provides interpolation between the points
if the percentile isn't exactly on a boundary:
{code}
np.percentile(np.array([1, 2, 3, 4, 5, 6]), 50)
==> 3.5
{code}
Since we are using the histogram, we don't really know where that lands, so we just need to
return a value inside of the range. Currently we are returning the end of the range before
where the percentile occurs.

I've changed EstimatedHistogram to use ceil instead of floor, and updated the tests accordingly.

> Percentile computation should use ceil not floor in EstimatedHistogram
> ----------------------------------------------------------------------
>
>                 Key: CASSANDRA-8883
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8883
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Chris Lohfink
>            Assignee: Carl Yeksigian
>            Priority: Minor
>             Fix For: 2.1.4
>
>         Attachments: 8883-2.1.txt
>
>
> When computing the pcount Cassandra uses floor and the comparison with elements is >=
so given a simple example of there being a total of five elements
> {code}
> // data
> [1, 1, 1, 1, 1]
> // offsets
> [1, 2, 3, 4, 5]
> {code}
> Cassandra  would report the 50th percentile as 2.  While 3 is the more expected value.
 As a comparison using numpy
> {code}
> import numpy as np
> np.percentile(np.array([1, 2, 3, 4, 5]), 50)
> ==> 3.0
> {code}
> The percentiles was added in CASSANDRA-4022 but is now used a lot in metrics Cassandra
reports.  I think it should error on the side on overestimating instead of underestimating.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message