lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Johan Stuyts" <j.stu...@hippo.nl>
Subject RE: Preventing merging by IndexWriter
Date Wed, 18 Oct 2006 08:36:46 GMT
> Why go through all this effort when it's easy to make your 
> own unique ID?
> Add a new field to each document "myuniqueid" and fill it in 
> yourself. It'll
> never change then.

I am sorry I did not mention in my post that I am aware of this solution
but that it cannot be used for my purposes. I need the stable ID during
filtering and afterwards for counting for faceted browsing. My tests
show, and from the documentation/mailing lists I conclude, that
retrieving a stable ID from a field during filtering and for each hit in
a query result is too expensive.

After your post I put some more thought into storing a stable ID in a
field. I figured I could read all the stable IDs once and create a map
from Lucene IDs to stable IDs. But this takes too long (> 600 ms on my
laptop) for a small set of documents (< 150,000). Another problem is
that I have to do millions of additional lookups during filtering and
counting.

> Of course, I may misunderstand your problem space enough that this is
> useless. If so, please tell us the problem you're trying to 
> solve and maybe
> wiser heads than mine will have better suggestions....

Here is a description of our problem. We want to build a repository that
can handle a number of documents that is in the low millions (we are
designing the repository for 10 million documents intially). Almost all
navigation through this repository will be faceted. For this we need to
be able to filter based on the facet values selected by the user, and we
have to count how many documents in the search result have a particular
facet value for multiple (estimation: 25-40) facet values. The documents
in the repository are constantly changed and we want the faceted
navigation to be updated in near real-time: if the user refreshes a
search page after making changes to a document, the changes should be
visible. I estimate we have about 250-500 ms, the time it takes to go to
another page and refresh it, to update the index(es).

My idea is to use Lucene for regular searching, and use a custom index
for filtering based on facets and for counting the number of matches for
facet values. For this to work, (reasonably) stable IDs are needed so
updating the facet value index is simply changing values in doing a
number of arrays.

I am willing to sacrifice search performance for stable IDs if it gains
performance in faceted filtering and counting.

Johan


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message