lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Saurabh Gokhale <saurabhgokh...@gmail.com>
Subject Question on the use of Synonym Filter while searching using MoreLikeThis
Date Mon, 09 May 2011 15:10:56 GMT
Hi All,

This is my first question for this forum. I am fairly familiar with Lucene
and using 2.9.4 in my project (not using Solr). I have a following question
for the use of Synonym filter.



While indexing contents, I am using following analyzer setup

[Analyzer1] == StandardTokenizer -->  StandardFilter --> LowerCaseFilter
--> StopFilter --> PorterStemFilter

And while searching using MoreLikeThis I am using analyzer similar to the
previous one but with addition of synonym filter

[Analyzer2] == StandardTokenizer -->  StandardFilter --> LowerCaseFilter
--> StopFilter --> SynonymFilter --> PorterStemFilter


*Scenario 1: Analyzer 1 for indexing and searching*
Now I index document A, B and C using Analyzer1 and then use MoreLikeThis on
document D to find similar documents from the index using Analyzer1 (Not
Analyzer2), I get following output
A matched 40%
B matched 20%
C matched 5%

*Scenario 2: Analyzer 2 for indexing and searching*
My problem is, the moment I use Analyzer2 (with Synonym Filter) to index and
search similar documents to document D, all my results gets boost, my
results become:
A matched 60%
B matched 40%
C matched 25%

*Scenario 3: Analyzer 1 for indexing and Analyzer 2 for searching*
But if I use Analyzer1 for indexing and Analyzer2 for searching, then my
results go way down
A matched 15%
B matched 11%
C matched 2%

When I dig into the reason why the % matching went down, I understood that
this is happening because when searching using Synonym analyzer, I tend to
get much more interesting terms
[moreLikeThis.retrieveInterestingTerms(reader)] and then most of these
synonym words match with all the documents bringing down its tf and idf
resulting into less matching percentages for the documents.



*So my question is:*
1. Is it correct to use Analyzer without synonym filter for indexing and
with synonym filter for searching?
2. Is there any other setting that I am missing causing all the matching
percentages to go down?

My search setting while using MoreLikeThis are

MoreLikeThis mlt = new MoreLikeThis(index);
SynonymEngine engine = new WordNetSynonymEngine(new File("PATH"));

mlt.setMinWordLen(3);
mlt.setBoost(true);
mlt.setMinTermFreq(2);
mlt.setMinDocFreq(0);
mlt.setMaxQueryTerms(100);
mlt.setAnalyzer(new PorterSynonymStandardAnalyzer(engine));



Thanks in advance

Saurabh

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message