lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jack Krupansky" <j...@basetechnology.com>
Subject Re: Differences in MLT Query Terms Question
Date Tue, 08 Jan 2013 23:34:53 GMT
The term "arv" is on the first list, but not the second. Maybe it's document 
frequency fell below the setting for minimum document frequency on the 
second run.

Or, maybe the minimum word length was set to 4 or more on the second run.

Are you using MoreLikeThisQuery or directly using MoreLikeThis?

Or, possibly "arv" appears later in a document on the second run, after the 
number of tokens specified by maxNumTokensParsed.

-- Jack Krupansky

-----Original Message----- 
From: Peter Lavin
Sent: Tuesday, January 08, 2013 1:46 PM
To: java-user@lucene.apache.org
Subject: Differences in MLT Query Terms Question


Dear Users,

I am running some simple experiments with Lucene and am seeing something
I don't understand.

I have 16 text files on 4 different topics, ranging in size from 50-900
KB.  When I index all 16 of these and run an MLT query based on one of
the indexed documents, I get an expected result (i.e. similar topics are
found).

When I reduce the number of text files to 4 and index them (having taken
care to overwriting the previous index files), and then run the same MLT
query (based on the same document from the index), I get slightly
different scores. I'm assuming this is because the IDF is now different
because there is less documents.

For each run, I have set the max number of terms as...
mlt.setMaxQueryTerms(100)

However, when I compare the terms which get used for the MLT query on
the 16 document index and the 4 document index, they are slightly
different. I've printed, parsed and sorted them into two columns of a
CSV file. I've pasted a small part of it at the end of this email.

My Question(s)...
1) Can anybody explain why the set of terms used for the MLT query is
different when a file from an index of 16 documents versus 4 documents
is used?

2) Am I right in assuming that the reason for slightly different scores
in the IDF, or could it be this slight difference in the sets of terms
used (or possibly both)?

regards,
Peter


-- 
with best regards,
Peter Lavin,
PhD Candidate,
CAG - Computer Architecture & Grid Research Group,
Lloyd Institute, 005,
Trinity College Dublin, Ireland.
+353 1 8961536



"about","about"
"affordable","affordable"
"agents","agents"
"aids","aids"
"architecture","architecture"
"arv","based"
"based","blog"
"blog","board"
"board","business"
"business","care"
"care","commemorates"
"commemorates","contacts"
"contacts","contributions"
"contributions","coordinating"
"coordinating","core"
"core","countries"
"countries","country"
"country","data"
"data","decisions"
"decisions","details"
"details","disbursements"
"disbursements","documents"
"documents","donors"

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message