storm-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nathan Leung <ncle...@gmail.com>
Subject Re: Re[2]: Compute the top 100 million in the total 10 billion data efficiently.
Date Wed, 22 Jan 2014 14:15:34 GMT
You don't need the full set in ram. You only need to keep the largest 100m
in ram, but you would need to keep it in a sorted data structure. Our ram
is tight you can keep keys only then extract the data in a second pass.
On Jan 22, 2014 4:36 AM, "Ted Dunning" <ted.dunning@gmail.com> wrote:

>
> On Tue, Jan 21, 2014 at 7:31 AM, <churylin@gmail.com> wrote:
>
>> You mentioned a approximate algorithm. That's great! I will check it out
>> later. But, Is there a way to calculate it in a precise way?
>
>
> If you want to select the 1% largest numbers, then you have a few choices.
>
> If you have memory for the full set, you can sort.
>
> If you have room to keep 1% of the samples in memory, you need to do 100
> passes.
>
> If you are willing to accept small errors, then you can do it in a single
> pass.
>
> These trade-offs are not optional, but are theorems.
>
>
>

Mime
View raw message