lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Lavin <>
Subject Differences in MLT Query Terms Question
Date Tue, 08 Jan 2013 18:46:58 GMT

Dear Users,

I am running some simple experiments with Lucene and am seeing something 
I don't understand.

I have 16 text files on 4 different topics, ranging in size from 50-900 
KB.  When I index all 16 of these and run an MLT query based on one of 
the indexed documents, I get an expected result (i.e. similar topics are 

When I reduce the number of text files to 4 and index them (having taken 
care to overwriting the previous index files), and then run the same MLT 
query (based on the same document from the index), I get slightly 
different scores. I'm assuming this is because the IDF is now different 
because there is less documents.

For each run, I have set the max number of terms as...

However, when I compare the terms which get used for the MLT query on 
the 16 document index and the 4 document index, they are slightly 
different. I've printed, parsed and sorted them into two columns of a 
CSV file. I've pasted a small part of it at the end of this email.

My Question(s)...
1) Can anybody explain why the set of terms used for the MLT query is 
different when a file from an index of 16 documents versus 4 documents 
is used?

2) Am I right in assuming that the reason for slightly different scores 
in the IDF, or could it be this slight difference in the sets of terms 
used (or possibly both)?


with best regards,
Peter Lavin,
PhD Candidate,
CAG - Computer Architecture & Grid Research Group,
Lloyd Institute, 005,
Trinity College Dublin, Ireland.
+353 1 8961536


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message