You don't need the full set in ram. You only need to keep the largest 100m
in ram, but you would need to keep it in a sorted data structure. Our ram
is tight you can keep keys only then extract the data in a second pass.
On Jan 22, 2014 4:36 AM, "Ted Dunning" <ted.dunning@gmail.com> wrote:
>
> On Tue, Jan 21, 2014 at 7:31 AM, <churylin@gmail.com> wrote:
>
>> You mentioned a approximate algorithm. That's great! I will check it out
>> later. But, Is there a way to calculate it in a precise way?
>
>
> If you want to select the 1% largest numbers, then you have a few choices.
>
> If you have memory for the full set, you can sort.
>
> If you have room to keep 1% of the samples in memory, you need to do 100
> passes.
>
> If you are willing to accept small errors, then you can do it in a single
> pass.
>
> These trade-offs are not optional, but are theorems.
>
>
>
|