lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Armbrust, Daniel C." <Armbrust.Dan...@mayo.edu>
Subject RE: build a case insensitive index
Date Fri, 12 Dec 2003 15:56:07 GMT
I believe that if you enter an identical document twice, when you search, you will get it back
twice.  If you don't want duplicate results, I think you will need to keep a hashset of the
terms you have already indexed, and not add the document of the lowercase values are equal
(or something along those lines)

Dan

-----Original Message-----
From: Thomas Krämer [mailto:kraemert@smail.uni-koeln.de] 
Sent: Thursday, December 11, 2003 3:01 PM
To: Lucene Users List
Subject: build a case insensitive index


Hello Lucene Users

i need a document term matrix to initialize a neural network, that i 
want to use to integrate user feedback in the retrieval process.

until now, i am using a slightly modified class of the IndexHTML example.

how can i create an index of all the terms in a collection without 
"term" and "Term" being indexed twice?

in the example, a standard analyzer is used, and in the documentation it 
sais :


Filters StandardTokenizer with StandardFilter, LowerCaseFilter and 
StopFilter.

So, why do i get double entries for terms in upper- and lower case writing?


Regards.

Thomas


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message