lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wettin <karl.wet...@gmail.com>
Subject Re: lucene suggest
Date Wed, 22 Aug 2007 12:32:22 GMT

21 aug 2007 kl. 13.10 skrev Jens Grivolla:

> On 8/21/07, Heba Farouk <heba.farouk@yahoo.com> wrote:
>> the documents are not duplicated, i mean the hits (assume that 2  
>> documents have the same subject but with different authors, so if  
>> i'm searching the subject, the returned hits will have duplicates )
>> i was asking if i can remove duplicates from the hits??
>
> You may not want to work with documents at all (where you have the
> duplicates), but rather with the terms in your index directly.  Take a
> look at WildcardTermEnum etc.

My favorite solution for this is a stand alone trie, and such a  
solution is available in LUCENE-625.

Another way is to create an ngram-index.

It is usually a good idea to create an "a priori" corpus with a  
limited set of data. I prefere common user queries rather than items  
in the index. Especially if your corpus is large and you have a lot  
of server load.

Try LUCENE-550 as a priori index. My guess is that it would  
outperform a RAMDirectory 20x at 25,000 title-sized (40 chars avg)  
documents.


-- 
karl

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message