Thanks a lot, everyone!
This is really helpful. I only expect to get the top k items out of an
nsized vector. The heap should be small enough to fit in memory and (for my
case) it might be an overkill to try to parallelize it. I will implement
this using the algorithms that were suggested here. I also think it would be
great if there was a more general class that could be instantiated with a
vector, but for my specific case it's probably not needed.
Again, thanks everyone. I'm surprised of the level of engagement of this
community. This is definitely a huge plus to our decision to go with Mahout.
Thanks,
Julian
2011/4/25 Ted Dunning <ted.dunning@gmail.com>
> The simplest implementation is O(n log(k)) (inserts every element, then
> drops at most one element).
>
> Caching the kth best element and only inserting new items has the same
> worst case behavior in ordered cases.
>
> There is a true O(n + k) algorithm based on quicksort (I think).
>
> On Mon, Apr 25, 2011 at 2:26 PM, Jake Mannix <jake.mannix@gmail.com>
> wrote:
>
> > On Mon, Apr 25, 2011 at 2:23 PM, Jake Mannix <jake.mannix@gmail.com>
> > wrote:
> > >
> > >
> > > This is O(n + k*log(k)) for a randomly sorted list.
> > >
> >
> > This is a bit sloppy: there's a multiplicative factor of log(n) also in
> the
> > second
> > additive factor, but that's still linear overall.
> >
> > jake
> >
>
