lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bernhard Messer <Bernhard.Mes...@intrafind.de>
Subject Re: Normalization of Documents
Date Sat, 13 Apr 2002 13:05:50 GMT
Hi,

the topic you are focusing on is a never ending story in content 
retrieval in general. There is no perfect solution which fits in every 
environment. Retrieving a document's context based on a single query 
term seems to be very difficult also. In Lucene it isn't de very 
difficult to change the ranking algorithm. If you don't like the field 
normalization, you could comment the following in line in the TermScorer 
class.

score *= Similarity.norm(norms[d]);

If you put a comment around this line, youre scoring is based on the 
term frequency.

If more people are interested, we could think on a little bit more 
flexible ranking system within Lucene. There would be several parameters 
which from the environment which could be used to rank a document. 
Therefore we would need an interface where we could change the lucene 
document boost factor during runtime. For example, a document's ranking 
could be based on:
    links pointing to that document (like Google)
    last modification date,
    size of the document,
    term frequency,
    how often was it displayed by other users, sending the same query 
terms to the system
    .....

Let me know if you find that idea interessting, i would like to work on 
that topic.

--Bernhard



Peter Carlson wrote:

>I have noticed the same issue.
>
>>From what I understand, this is both the way it should work and a problem.
>Shorter documents which have a given term, should be more relevant because
>more of the document is about that term (i.e the term takes a greater % of
>the document). However, when there are documents of completely different
>sizes (i.e. 20 words vs. 2000 words) this assumption doesn't hold up very
>well.
>
>One solution I've heard is to extract the concepts of the documents, then
>search on those. The concepts are still difficult to extract if the document
>is too short, but it may provide a way to standardize documents. I have been
>lazily looking for an open source, academic concept extractor, but I haven't
>found one. There are companies like Semio and ActiveNavigation which provide
>this service for an expense fee.
>
>Let me know if you find anything or have other ideas.
>
>--Peter
>
>
>On 4/9/02 10:07 PM, "Melissa Mifsud" <melissamifsud@yahoo.com> wrote:
>
>>Hi,
>>
>>Documents which are shorter in length always seem to score higher in Lucene. I
>>was under the impression that the nornalization factors in the scoring
>>function used by Lucene would improve this, however, after a couple of
>>experiments, the short documents still always score the highest.
>>
>>Does anyone have any ideas as to how it is possible to make lengthier
>>documents score higher?
>>
>>Also, I would like a way to boost documents according to the amount of
>>in-links this document has.
>>
>>Has anyone implemented a type of Document.setBoost() method?
>>
>>I found a thread in the lucene-dev mailinglist where Doug Cutting mentions
>>that this would be a great feature to add to Lucene. No one followed his
>>email.
>>
>>Melissa.
>>
>
>
>--
>To unsubscribe, e-mail:   <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
>For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
>
>



--
To unsubscribe, e-mail:   <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>


Mime
View raw message