lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ahmet Arslan <iori...@yahoo.com>
Subject Re: Indexing and search questions
Date Tue, 20 Apr 2010 21:27:23 GMT

> I'd like to use lucene to search text
> documents for the existence of a large
> list of search terms. I have a file that contains thousands
> of entries, one
> word per line. I was thinking about to writing a
> specialized analyzer
> that tokenizes the document by  looking up each token
> in the source document
> in my list of words and return terms for words that exist
> in my file. I'm
> hoping that using this approach the index file will contain
> only items that
> exist in my document. 

Sounds like KeepWordFilter[1][2] is what you are looking for. keepwords.txt will be the file
that contains thousands of entries, one word per line. 
And just as you guessed using this approach, the index will contain
only items that exist in your document (keepwords.txt). 

I can share the code to use this TokenFilter in Lucene if you want. Or alternatively you can
easily copy and paste KeepWordFilter.java

[1]http://lucene.apache.org/solr/api/org/apache/solr/analysis/KeepWordFilter.html

[2]http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.KeepWordFilterFactory




      

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message