Return-Path: Delivered-To: apmail-commons-dev-archive@www.apache.org Received: (qmail 90497 invoked from network); 18 Sep 2010 14:17:04 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 18 Sep 2010 14:17:04 -0000 Received: (qmail 2281 invoked by uid 500); 18 Sep 2010 14:17:04 -0000 Delivered-To: apmail-commons-dev-archive@commons.apache.org Received: (qmail 1707 invoked by uid 500); 18 Sep 2010 14:17:01 -0000 Mailing-List: contact dev-help@commons.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: "Commons Developers List" Delivered-To: mailing list dev@commons.apache.org Received: (qmail 1692 invoked by uid 99); 18 Sep 2010 14:17:00 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 18 Sep 2010 14:17:00 +0000 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests=FREEMAIL_FROM,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of phil.steitz@gmail.com designates 209.85.216.43 as permitted sender) Received: from [209.85.216.43] (HELO mail-qw0-f43.google.com) (209.85.216.43) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 18 Sep 2010 14:16:54 +0000 Received: by qwj8 with SMTP id 8so3119802qwj.30 for ; Sat, 18 Sep 2010 07:16:33 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from :user-agent:mime-version:to:subject:references:in-reply-to :content-type:content-transfer-encoding; bh=/iS0BG+74+XLH6FaHKmy/c2rp2tCnwU7tin71kQfpQA=; b=fwI7Q0ULubEiUTVolYqvCDPiu36k5lLwI9ilfhBlkwvOqr9mIJxirmosxEcpesznv8 cD5PfKLZSLNWrkrmw+5TFambeOc/KWCGKnDhV9dLsAjD7q2AEjFwHDgQrJEYBJVAmWv/ N5HpdBAycuX3q/DJspCKe0HaFRoayHc99qdvs= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:user-agent:mime-version:to:subject:references :in-reply-to:content-type:content-transfer-encoding; b=pO9UYLrm/1fru+H+2b28vKJgUufOsAKiIrc1tndOfeZtuZj3bfLfduOfnS00uLIsag K8P3tklHj4ikvr/NT6JYM5PdZ8b8WoPsbn21eX4N71io2yqn5RgzVBWsbcEdDdotnUpN Sd4nrfbd83HB4gkx/lz6ta6GANnYFkcsa/+cU= Received: by 10.229.88.15 with SMTP id y15mr4702285qcl.39.1284819393089; Sat, 18 Sep 2010 07:16:33 -0700 (PDT) Received: from phil-steitzs-macbook-pro.local (c-68-44-120-111.hsd1.de.comcast.net [68.44.120.111]) by mx.google.com with ESMTPS id 9sm5469297qca.6.2010.09.18.07.16.31 (version=SSLv3 cipher=RC4-MD5); Sat, 18 Sep 2010 07:16:31 -0700 (PDT) Message-ID: <4C94C9BE.6050605@gmail.com> Date: Sat, 18 Sep 2010 10:16:30 -0400 From: Phil Steitz User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.2.9) Gecko/20100915 Thunderbird/3.1.4 MIME-Version: 1.0 To: Commons Developers List Subject: Re: [math] speeding up percentile based statistics References: <4C93A6BB.607@free.fr> <4C93BF4B.7070308@free.fr> In-Reply-To: <4C93BF4B.7070308@free.fr> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit On 9/17/10 3:19 PM, Luc Maisonobe wrote: > Le 17/09/2010 19:55, Ted Dunning a écrit : >> There are also on-line percentile estimation methods that require only a >> single pass over the data in order to get good estimates of several quantile >> at the same time. >> >> If you are interested in getting an estimate of some quantile of the >> underlying distribution that generated your data then these on-line methods >> will give you an estimate that is very nearly as accurate as sorting your >> sample. Sorting gives you the exact quantiles of your sample, but only an >> estimate of the quantiles of your underlying distribution. >> >> There is a simplified implementation of this in Mahout along with test cases >> that demonstrate reasonable accuracy. >> >> These techniques are well described in the article: *Incremental Quantile >> Estimation for Massive >> Tracking >> * >> * >> * >> *(available at >> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.105.1580 )* > > Thanks for the pointer, it looks interesting. > >> * >> * >> *Regarding the incremental selection method to find multiple quantiles, I >> think that you save a little bit if you are looking for a few quantiles, but >> the added complexity and special cases are going to make testing difficult. >> Wouldn't it be better to use one of the following strategies:* >> * >> * >> *- keep a copy of the sorted data and if that copy is available, just use >> it. This cuts the cost of 100 quantiles probed incrementally to a single >> sort. > > This is exactly what this user did: put the sort out of the loop. > >> Moreover, since sort is n log n without a radix sort, then the >> increment selection algorithm can only win if 100< log n which is pretty >> unlikely.* > > I don't understand. Partition-based selection algorithms are basically > partition-based sort algorithm where we recurse in only one of the > partition once the pivot has been chosen. Subsequent calls therefore > don't restart from an array of size n but from smaller sub-arrays, has > the pivot can be saved (at least the top level ones). If at the end the > number of selections is so high the array ends up to be completely > sorted, the total cost is probably not much higher than what an initial > sort would have done. It will be higher since their are some bookkeeping > to do, but not so much higher I think. Doing one call corresponds to > resuming the partial sort from the state resulting from previous calls. > > Did I miss something ? > > >> * >> * >> *- add a method to probe multiple quantiles at the same time. This >> potentially uses less memory than the first approach, but is dependent on >> the user calling the right method.* > > Since there are different use case, having several methods seems fair. A > multiple quantiles method is a good idea. > >> * >> * >> *- use the on-line algorithm mentioned above with two passes, one to >> estimate the quantiles of interest, a second to refine the estimate using >> the actual data. This allows the multiple probe method to be linear in time >> and should give you exact sample quantiles. It doesn't help the repeated >> probe problem.* > > I'm not sure I understand well. However, the on-line method by itself > would be an interesting addition. It would allow quantiles computation > with the Storeless versions of our classes. +1 - definitely valuable for the storeless case. I need to think about this some more / see some patches; but my initial reaction is that we can get a big bang from just doing what the user did (caching the sorted array and inserting into it when addValue is called) for the data-in-memory impls. I would suggest implementing that for for Percentile itself (taking care to handle addValue and rolling windows correctly) and DescriptiveStatistics; but add a new (storelesss) statistic based on an incremental quantile estimation algorithm and make this accessible via the storeless aggregate, SummaryStatistics. It might be interesting to include this optionally with DescriptiveStatistics as well, so that in the rolling window case you could compare the current window distribution to the full sample. Phil > > Luc > >> * >> * >> On Fri, Sep 17, 2010 at 10:34 AM, Luc Maisonobewrote: >> >>> Hi all, >>> >>> During a recent face to face discussion with some commons-math users >>> from a large project, it appeared one implementation choice for >>> Percentile statistics was a huge performance bottleneck. >>> >>> When the Percentile.evaluate method is called, the sample array is >>> copied and sorted. In one use case, 100 calls were made for each sample, >>> so the same array was sorted 100 times. In another use case, only one >>> call is made (typically to compute the median) but the user wondered why >>> the array should be completely sorted. In fact, using a selection >>> algorithm instead of a sorting algorithm would be sufficient. >>> >>> I would like to have you opinion about providing a different evaluate >>> method that would process an array provided previously and use only >>> selections to provide the percentile. Consider for example the following >>> input array: >>> >>> [ -6, 8, 4, -5, 0, 2, 7 ] >>> >>> If we first as for the median, a selection algorithm may reorganize the >>> array as: >>> >>> [ -6, 0, -5, 2, 8, 4, 7 ] >>> >>> were the left part is smaller than the central element, the right part >>> is larger and hence the central element 2 is known to be the median. >>> >>> Then we ask for another value, the 25% percentile. Since the object >>> already knows the smaller elements are in the left part, it can use a >>> select algorithm on the 3 elements left part and extract the value -5 >>> without even trying to sort the upper part of the array. >>> >>> What do you think ? >>> >>> Luc >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org >>> For additional commands, e-mail: dev-help@commons.apache.org >>> >>> >> > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org > For additional commands, e-mail: dev-help@commons.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org For additional commands, e-mail: dev-help@commons.apache.org