lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Lavin <>
Subject Re: Differences in MLT Query Terms Question
Date Wed, 09 Jan 2013 12:06:40 GMT

Hi Jack, thanks for your ideas, I've added some comments to your 
questions, maybe you can throw some more light on this...

On 01/08/2013 11:34 PM, Jack Krupansky wrote:
> The term "arv" is on the first list, but not the second. Maybe it's
> document frequency fell below the setting for minimum document frequency
> on the second run.
> Or, maybe the minimum word length was set to 4 or more on the second run.
The same parameters are used (same code) for each run, all that changes 
is that I change the path to a different folder, one containing 16, the 
other 4 files. The smaller folder was made by simply deleting the 
unwanted 12 files.

> Are you using MoreLikeThisQuery or directly using MoreLikeThis?
I'm using MoreLikeThis directly, does this make a difference?

> Or, possibly "arv" appears later in a document on the second run, after
> the number of tokens specified by maxNumTokensParsed.
The files used in the second run are identical, and each file is read 
from disk and indexed individually (as is common I'm sure). I look at 
this, and when all 16 files are indexed together, the results are 
repeatedly identical, and the same for the 4 files runs. I.e. the 
outcomes for both 16 and 4 files can be reproduced.

The reason for my question (and for doing these runs) is that I'm using 
Lucene in an application where I want to use the similarity measurements 
between documents as a metric in another area. If the similarity score 
changes when the size of the index changes, I need to understand.

thanks again,

> -- Jack Krupansky
> -----Original Message----- From: Peter Lavin
> Sent: Tuesday, January 08, 2013 1:46 PM
> To:
> Subject: Differences in MLT Query Terms Question
> Dear Users,
> I am running some simple experiments with Lucene and am seeing something
> I don't understand.
> I have 16 text files on 4 different topics, ranging in size from 50-900
> KB. When I index all 16 of these and run an MLT query based on one of
> the indexed documents, I get an expected result (i.e. similar topics are
> found).
> When I reduce the number of text files to 4 and index them (having taken
> care to overwriting the previous index files), and then run the same MLT
> query (based on the same document from the index), I get slightly
> different scores. I'm assuming this is because the IDF is now different
> because there is less documents.
> For each run, I have set the max number of terms as...
> mlt.setMaxQueryTerms(100)
> However, when I compare the terms which get used for the MLT query on
> the 16 document index and the 4 document index, they are slightly
> different. I've printed, parsed and sorted them into two columns of a
> CSV file. I've pasted a small part of it at the end of this email.
> My Question(s)...
> 1) Can anybody explain why the set of terms used for the MLT query is
> different when a file from an index of 16 documents versus 4 documents
> is used?
> 2) Am I right in assuming that the reason for slightly different scores
> in the IDF, or could it be this slight difference in the sets of terms
> used (or possibly both)?
> regards,
> Peter

with best regards,
Peter Lavin,
PhD Candidate,
CAG - Computer Architecture & Grid Research Group,
Lloyd Institute, 005,
Trinity College Dublin, Ireland.
+353 1 8961536

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message