lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Soeren Pekrul <soeren.pek...@gmx.de>
Subject Re: IDFrequency
Date Fri, 02 Feb 2007 22:21:23 GMT
DECAFFMEYER MATHIEU wrote:
> The score depends of
> 1. the query
> 2. the matched document
> 3. the index.
> 
> I don't really understand why the index must influence the score (why it 
> ahs been implemented that way).

The score should be the similarity (inverse distance) between the query 
and the matched document. How similar is the found document to my query? 
How likely is the found document relevant for my question (query)?

If your query consists of just one word (term) the idf has no influence. 
If the query consists of multiple terms it could be useful weighting the 
terms. The idea is as follow:

1. Indexing view
The task is to find important words in a document, to find the keywords 
describing this document.
A term that occurs in just one document identifies that document. This 
term seems to be very important for that document. It could be a good 
keyword candidate.
If a term occurs in all documents (like stop words) it can't describe a 
document because there is no difference to the other documents.

2. Query view
A term that occurs in just one document identifies that document. Your 
query will return exactly that document, a perfect result. No ranking is 
necessary.
If you are searching for a term that occurs in all documents you will 
retrieve the complete collection. You have no selection, no sub 
collection. You have the same situation as before your query. This term 
is not a real help to find an answer of a question. The weight of this 
term could be 0 or very small.
If a term has a small document frequency the weight is high and if it 
has a large document frequency it has a less weight.

A lot of experiments show that score=tf*idf is a quite good ranking 
method. It is not the best for all cases but not bad for the general 
case. You can use it or not. It depends of your requirements.

> Let's say I have this page Logistics.htm
> I have just one time the word "experience" in it.
> It will get a high score because of the IDF but it occurs only once in 
> my document.

Did you really mean the IDF? That looks for me like TF (term frequency), 
how often a term occurs in a document. The IDF (inverse document 
frequency) means in how many documents occurs the term in my collection.
The idea of tf is if you have already removed the stop words a term that 
occurs quite often in a document is more important for that document 
than a term that occurs quite rare.

Sören

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message