Return-Path: X-Original-To: apmail-storm-user-archive@minotaur.apache.org Delivered-To: apmail-storm-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 3F985109C9 for ; Wed, 22 Jan 2014 14:16:01 +0000 (UTC) Received: (qmail 75691 invoked by uid 500); 22 Jan 2014 14:16:00 -0000 Delivered-To: apmail-storm-user-archive@storm.apache.org Received: (qmail 75666 invoked by uid 500); 22 Jan 2014 14:16:00 -0000 Mailing-List: contact user-help@storm.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@storm.incubator.apache.org Delivered-To: mailing list user@storm.incubator.apache.org Received: (qmail 75658 invoked by uid 99); 22 Jan 2014 14:16:00 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 22 Jan 2014 14:16:00 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of ncleung@gmail.com designates 209.85.213.177 as permitted sender) Received: from [209.85.213.177] (HELO mail-ig0-f177.google.com) (209.85.213.177) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 22 Jan 2014 14:15:54 +0000 Received: by mail-ig0-f177.google.com with SMTP id k19so1766798igc.4 for ; Wed, 22 Jan 2014 06:15:34 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=ycGnT5ChvLP8kMs3TEP4iDRdmX194d1aAOYi2c0mtMU=; b=FcZApSxiQPFWNzGCjLSUeTWHjctZfWjtnvApHN38tAXlbOejNn/RVyOjC3CZmjgLd6 mPA0KO7EMCmWMo17qCBHAAdmpqkJw90asXDCLdlmCBkYZRDswFVmSAteu8Cs8Ae6quM1 tx30+SmQoFLArLSskwvY1DFIexl/+LkTsvja+CUXEMbnQOofJ0oFkRD0coTJgO/w95hO MM2EqwgxXnmLn4zzCCiiAkelyU5qHsqXrEj9lO2ZDvqpch2FahTe93gOU8RKt3XVPQP+ cP9OHMRCKDcT6/P9w5k6BAprGB5x7Bu+WTuRQMF/9ML1KFyOxB365u/lNPIXO0D/zzzT hwhA== MIME-Version: 1.0 X-Received: by 10.50.42.134 with SMTP id o6mr3249454igl.18.1390400134130; Wed, 22 Jan 2014 06:15:34 -0800 (PST) Received: by 10.64.227.197 with HTTP; Wed, 22 Jan 2014 06:15:34 -0800 (PST) Received: by 10.64.227.197 with HTTP; Wed, 22 Jan 2014 06:15:34 -0800 (PST) In-Reply-To: References: <00098BF6-10D1-48CE-837F-9FFE92FB1E61@gmail.com> <1390311119.173262209@f7.my.com> Date: Wed, 22 Jan 2014 09:15:34 -0500 Message-ID: Subject: Re: Re[2]: Compute the top 100 million in the total 10 billion data efficiently. From: Nathan Leung To: user Content-Type: multipart/alternative; boundary=14dae93406ad17323d04f08fc29c X-Virus-Checked: Checked by ClamAV on apache.org --14dae93406ad17323d04f08fc29c Content-Type: text/plain; charset=ISO-8859-1 You don't need the full set in ram. You only need to keep the largest 100m in ram, but you would need to keep it in a sorted data structure. Our ram is tight you can keep keys only then extract the data in a second pass. On Jan 22, 2014 4:36 AM, "Ted Dunning" wrote: > > On Tue, Jan 21, 2014 at 7:31 AM, wrote: > >> You mentioned a approximate algorithm. That's great! I will check it out >> later. But, Is there a way to calculate it in a precise way? > > > If you want to select the 1% largest numbers, then you have a few choices. > > If you have memory for the full set, you can sort. > > If you have room to keep 1% of the samples in memory, you need to do 100 > passes. > > If you are willing to accept small errors, then you can do it in a single > pass. > > These trade-offs are not optional, but are theorems. > > > --14dae93406ad17323d04f08fc29c Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable

You don't need the full set in ram. You only need to kee= p the largest 100m in ram, but you would need to keep it in a sorted data s= tructure. Our ram is tight you can keep keys only then extract the data in = a second pass.

On Jan 22, 2014 4:36 AM, "Ted Dunning"= <ted.dunning@gmail.com>= wrote:

= On Tue, Jan 21, 2014 at 7:31 AM, <churylin@gmail.com> wrot= e:
You mentioned a approximate algorithm. That's great! I will check it o= ut later. But, Is there a way to calculate it in a precise way?

If you want to select the 1% largest numbers, then you have a fe= w choices.

If you have= memory for the full set, you can sort.
If you have room to keep 1% of the sample= s in memory, you need to do 100 passes.

If you are = willing to accept small errors, then you can do it in a single pass.
<= div class=3D"gmail_extra">
These trade-= offs are not optional, but are theorems.


<= /div>
--14dae93406ad17323d04f08fc29c--