Return-Path: Delivered-To: apmail-jakarta-lucene-user-archive@apache.org Received: (qmail 32212 invoked from network); 14 Feb 2002 17:17:38 -0000 Received: from unknown (HELO nagoya.betaversion.org) (192.18.49.131) by daedalus.apache.org with SMTP; 14 Feb 2002 17:17:38 -0000 Received: (qmail 16486 invoked by uid 97); 14 Feb 2002 17:17:13 -0000 Delivered-To: qmlist-jakarta-archive-lucene-user@jakarta.apache.org Received: (qmail 16347 invoked by uid 97); 14 Feb 2002 17:17:11 -0000 Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Users List" Reply-To: "Lucene Users List" Delivered-To: mailing list lucene-user@jakarta.apache.org Received: (qmail 16255 invoked from network); 14 Feb 2002 17:17:10 -0000 Message-ID: <4BC270C6AB8AD411AD0B00B0D0493DF001AC408E@mail.grandcentral.com> From: Doug Cutting To: 'Lucene Users List' Subject: RE: using lucene with a very large index Date: Thu, 14 Feb 2002 09:12:27 -0800 MIME-Version: 1.0 X-Mailer: Internet Mail Service (5.5.2653.19) Content-Type: text/plain; charset="iso-8859-1" X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N > From: tal blum [mailto:thetalthe@hotmail.com] > > 2) Does the Document id changes after merging indexes adding > or deleting documents? Yes. > 4) assuming I have a term query that has a large number of > hits say 10 millions, is there a way to get the say the top > 10 results without going through all the hits? Your best bet is to use the normal search API. > From: tal blum [mailto:thetalthe@hotmail.com] > > one solution to that is to change the implementation and > store the docs > sorted by their term score. That would make incremental index updates much slower, since every time a document is added, the list of documents containing each term in that document would need to be re-sorted. Currently we only need to append new entries, which is much simpler. You could optimize this in various ways (e.g., instead take the hit at search time) but it would still make things slower for rapidly changing indexes. Also, while this would make single term queries faster, multi-term queries are more complex to accellerate. The highest scoring match for a two term query may be in a document where one term has a very high weight and the other has a very low weight. There have been papers written (I don't have the references handy) exploring this issue, and, in general, there isn't an algorithm that is guaranteed to return the highest scoring documents for multi-term queries that does not in most cases have to process nearly all of the documents containing those terms. That said, it is possible to use such an index to vastly accellerate searches that *usually* return the highest scoring documents. Such a heuristic search technique is among the things required to scale Lucene to extremely large collections (e.g., hundreds of millions of documents). There are also lower-tech optimizations. For example, one can simply keep a small index containing the highest-quality documents that is always searched first. If enough hits are found there, you're done. A real internet search engine combines lots of tricks in order to scale: segmenting indexes by quality; heuristic search methods; and distributed searching. Deploying something like Google is not a small task. I would someday like to add a heuristic search component to Lucene, that uses a special index format (possibly with term document lists sorted by normalized frequency, as you suggest). I have some experience doing this at Excite, and it pays off big time. But it would take me several weeks full-time to implement this, and I don't currently have that time. Perhaps (with the support of an interested sponsor) I could make time this summer to implement this. In the meantime, if you encounter performance problems with a very large index, you might try segmenting your index by document quality and/or distributed search. Doug -- To unsubscribe, e-mail: For additional commands, e-mail: