hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rafi Shachar (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HBASE-14324) MetricSampleQuantiles maintains quantiles summary in a wrong way
Date Thu, 27 Aug 2015 10:16:45 GMT

     [ https://issues.apache.org/jira/browse/HBASE-14324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Rafi Shachar updated HBASE-14324:
---------------------------------
    Description: 
The MetricSampleQuantiles computes quantiles estimations for data stream according to a paper
published by CKMS in 2005. However the implementation is incorrect compared to the paper.
In particular, the insert and compact methods use the rank of an item in the summary whereas
the rank should be the estimated rank in the stream. Due to this bug the resulting summary
is much larger than required: according to my experiments it is more than 10 times larger.
Furthermore, the summary size continues to grow with stream size while the distribution doesn't
change. When the number of items is in the tens of millions the summary size is about 200K
while it should be in the range of few thousands. This has significant effect on performance.
I didn't see significant effect on accuracy.

The insert batch and compress methods call allowableError() passing the rank of the item in
the summary. It actually should pass the estimated rank in the stream. In the CKMS paper this
rank is denoted by r_i, which more precisely is the estimated rank of item i-1. allowableError
now considers the size of the summary where it should consider the number of items that has
been observed.

In addition, the class currently uses a static sized buffer of size 500. This yields poor
runtime performance. Increasing this buffer to size 10000 or more yields much better performance
(I'm using it with buffer size of 100K). The buffer size can be dynamic and grow with number
of items in the stream.

  was:
The MetricSampleQuantiles computes quantiles estimations for data stream according to a paper
published by CKMS in 2005. However the implementation is incorrect compared to the paper.
In particular, the insert and compact methods use the rank of an item in the summary whereas
the rank should be the estimated rank in the stream. Due to this bug the resulting summary
is much larger than required: according to my experiments it is more than 10 times larger.
Furthermore, the summary size is not continues to grow with stream size while the distribution
doesn't change. When the number of items is in the tens of millions the summary size is about
200K while it should be in the range of few thousands. This has significant effect on performance.
I didn't see significant effect on accuracy.

The insert batch and compress methods call allowableError() passing the rank of the item in
the summary. It actually should pass the estimated rank in the stream. In the CKMS paper this
rank is denoted by r_i, which more precisely is the estimated rank of item i-1. allowableError
now considers the size of the summary where it should consider the number of items that has
been observed.

In addition, the class currently uses a static sized buffer of size 500. This yields poor
runtime performance. Increasing this buffer to size 10000 or more yields much better performance
(I'm using it with buffer size of 100K). The buffer size can be dynamic and grow with number
of items in the stream.


> MetricSampleQuantiles maintains quantiles summary in a wrong way
> ----------------------------------------------------------------
>
>                 Key: HBASE-14324
>                 URL: https://issues.apache.org/jira/browse/HBASE-14324
>             Project: HBase
>          Issue Type: Bug
>          Components: metrics
>            Reporter: Rafi Shachar
>
> The MetricSampleQuantiles computes quantiles estimations for data stream according to
a paper published by CKMS in 2005. However the implementation is incorrect compared to the
paper. In particular, the insert and compact methods use the rank of an item in the summary
whereas the rank should be the estimated rank in the stream. Due to this bug the resulting
summary is much larger than required: according to my experiments it is more than 10 times
larger. Furthermore, the summary size continues to grow with stream size while the distribution
doesn't change. When the number of items is in the tens of millions the summary size is about
200K while it should be in the range of few thousands. This has significant effect on performance.
I didn't see significant effect on accuracy.
> The insert batch and compress methods call allowableError() passing the rank of the item
in the summary. It actually should pass the estimated rank in the stream. In the CKMS paper
this rank is denoted by r_i, which more precisely is the estimated rank of item i-1. allowableError
now considers the size of the summary where it should consider the number of items that has
been observed.
> In addition, the class currently uses a static sized buffer of size 500. This yields
poor runtime performance. Increasing this buffer to size 10000 or more yields much better
performance (I'm using it with buffer size of 100K). The buffer size can be dynamic and grow
with number of items in the stream.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message