lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stanislav Jordanov" <ste...@sirma.bg>
Subject Re: Fast access to a random page of the search results.
Date Wed, 02 Mar 2005 16:40:49 GMT
Thank you guys,

there's a good chance that I will have the management persuaded to drop the
'random access requirement'.
As you surely know, the management (usually) tends to be franticly
optimistic.
True to this trend, our management suggested  us (the R&D team) that:
"... it is time to assume that we will have to modify the core Lucene engine
in order to achieve what we want."
Quite everyone in our craft likes to think of his(or her)self as fairly good
programming geek.
But honestly speeking, I'm lacking the confidence that I can all of a sudden
introduce a significant improvement in Lucene -
a far-from-trivial framework that has been evolving for the last 4 (or 5)
years.

Highly appreciating the work you've done:
Stenly

----- Original Message ----- 
From: "Doug Cutting" <cutting@apache.org>
To: "Lucene Users List" <lucene-user@jakarta.apache.org>
Sent: Wednesday, March 02, 2005 12:04 AM
Subject: Re: Fast access to a random page of the search results.


> Daniel Naber wrote:
> > After fixing this I can reproduce the problem with a local index that
> > contains about 220.000 documents (700MB). Fetching the first document
> > takes for example 30ms, fetching the last one takes >100ms. Of course I
> > tested this with a query that returns many results (about 50.000).
> > Actually it happens even with the default sorting, no need to sort by
some
> > specific field.
>
> In part this is due to the fact that Hits first searches for the
> top-scoring 100 documents.  Then, if you ask for a hit after that, it
> must re-query.  In part this is also due to the fact that maintaining a
> queue of the top 50k hits is more expensive than maintaining a queue of
> the top 100 hits, so the second query is slower.  And in part this could
> be caused by other things, such as that the highest ranking document
> might tend to be cached and not require disk io.
>
> One could perform profiling to determine which is the largest factor.
> Of these, only the first is really fixable: if you know you'll need hit
> 50k then you could tell this to Hits and have it perform only a single
> query.  But the algorithmic cost of keeping the queue of the top 50k is
> the same as collecting all the hits and sorting them.  So, in part,
> getting hits 49,990 through 50,000 is inherently slower than getting
> hits 0-10.  We can minimize that, but not eliminate it.
>
> Doug
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message