Mailing-List: contact user-help@storm.incubator.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@storm.incubator.apache.org
Received-SPF: pass (athena.apache.org: domain of ncleung@gmail.com designates
 209.85.213.177 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAJwFCa0wgj4bt+bBO4TEryPkDM1nEugdvuCy_yseEq9-i-1ZRQ@mail.gmail.com>
References: 
 <CAEN10Jo-iBsG4CRUgT3A6ZzU2UgucKd=FYNZmRqUhSSckQHdJQ@mail.gmail.com>
	<00098BF6-10D1-48CE-837F-9FFE92FB1E61@gmail.com>
	<1390311119.173262209@f7.my.com>
	<CAJwFCa0wgj4bt+bBO4TEryPkDM1nEugdvuCy_yseEq9-i-1ZRQ@mail.gmail.com>
Date: Wed, 22 Jan 2014 09:15:34 -0500
Message-ID: 
 <CAMHaYAe9GVABOq74Cqz-TgReSt=-6vA9mxRxDtUBsnJYzaGVyg@mail.gmail.com>
Subject: Re: Re[2]: Compute the top 100 million in the total 10 billion data
 efficiently.
From: Nathan Leung <ncleung@gmail.com>
To: user <user@storm.incubator.apache.org>
Content-Type: multipart/alternative; boundary=14dae93406ad17323d04f08fc29c

--14dae93406ad17323d04f08fc29c
Content-Type: text/plain; charset=ISO-8859-1

You don't need the full set in ram. You only need to keep the largest 100m
in ram, but you would need to keep it in a sorted data structure. Our ram
is tight you can keep keys only then extract the data in a second pass.
On Jan 22, 2014 4:36 AM, "Ted Dunning" <ted.dunning@gmail.com> wrote:

>
> On Tue, Jan 21, 2014 at 7:31 AM, <churylin@gmail.com> wrote:
>
>> You mentioned a approximate algorithm. That's great! I will check it out
>> later. But, Is there a way to calculate it in a precise way?
>
>
> If you want to select the 1% largest numbers, then you have a few choices.
>
> If you have memory for the full set, you can sort.
>
> If you have room to keep 1% of the samples in memory, you need to do 100
> passes.
>
> If you are willing to accept small errors, then you can do it in a single
> pass.
>
> These trade-offs are not optional, but are theorems.
>
>
>

--14dae93406ad17323d04f08fc29c
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<p dir=3D"ltr">You don&#39;t need the full set in ram. You only need to kee=
p the largest 100m in ram, but you would need to keep it in a sorted data s=
tructure. Our ram is tight you can keep keys only then extract the data in =
a second pass.</p>

<div class=3D"gmail_quote">On Jan 22, 2014 4:36 AM, &quot;Ted Dunning&quot;=
 &lt;<a href=3D"mailto:ted.dunning@gmail.com">ted.dunning@gmail.com</a>&gt;=
 wrote:<br type=3D"attribution"><blockquote class=3D"gmail_quote" style=3D"=
margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div dir=3D"ltr"><div class=3D"gmail_extra"><br><div class=3D"gmail_quote">=
On Tue, Jan 21, 2014 at 7:31 AM,  <span dir=3D"ltr">&lt;<a href=3D"mailto:c=
hurylin@gmail.com" target=3D"_blank">churylin@gmail.com</a>&gt;</span> wrot=
e:<br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-l=
eft:1px #ccc solid;padding-left:1ex">


You  mentioned a approximate algorithm. That&#39;s great! I will check it o=
ut later. But, Is there a way to calculate it in a precise way?</blockquote=
></div><br>If you want to select the 1% largest numbers, then you have a fe=
w choices.</div>


<div class=3D"gmail_extra"><br></div><div class=3D"gmail_extra">If you have=
 memory for the full set, you can sort.</div><div class=3D"gmail_extra"><br=
></div><div class=3D"gmail_extra">If you have room to keep 1% of the sample=
s in memory, you need to do 100 passes.</div>


<div class=3D"gmail_extra"><br></div><div class=3D"gmail_extra">If you are =
willing to accept small errors, then you can do it in a single pass.</div><=
div class=3D"gmail_extra"><br></div><div class=3D"gmail_extra">These trade-=
offs are not optional, but are theorems.</div>


<div class=3D"gmail_extra"><br></div><div class=3D"gmail_extra"><br></div><=
/div>
</blockquote></div>

--14dae93406ad17323d04f08fc29c--