lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eric D. Friedman" <e...@conveysoftware.com>
Subject Re: Normalization of Documents
Date Sat, 13 Apr 2002 17:12:24 GMT
Bernhard,

I think your idea is very interesting and would be happy to help out.

Eric

On Sat, 13 Apr 2002, Bernhard Messer wrote:

> Hi,
>
> the topic you are focusing on is a never ending story in content
> retrieval in general. There is no perfect solution which fits in every
> environment. Retrieving a document's context based on a single query
> term seems to be very difficult also. In Lucene it isn't de very
> difficult to change the ranking algorithm. If you don't like the field
> normalization, you could comment the following in line in the TermScorer
> class.
>
> score *= Similarity.norm(norms[d]);
>
> If you put a comment around this line, youre scoring is based on the
> term frequency.
>
> If more people are interested, we could think on a little bit more
> flexible ranking system within Lucene. There would be several parameters
> which from the environment which could be used to rank a document.
> Therefore we would need an interface where we could change the lucene
> document boost factor during runtime. For example, a document's ranking
> could be based on:
>     links pointing to that document (like Google)
>     last modification date,
>     size of the document,
>     term frequency,
>     how often was it displayed by other users, sending the same query
> terms to the system
>     .....
>
> Let me know if you find that idea interessting, i would like to work on
> that topic.
>
> --Bernhard
>
>
>
> Peter Carlson wrote:
>
> >I have noticed the same issue.
> >
> >From what I understand, this is both the way it should work and a problem.
> >Shorter documents which have a given term, should be more relevant because
> >more of the document is about that term (i.e the term takes a greater % of
> >the document). However, when there are documents of completely different
> >sizes (i.e. 20 words vs. 2000 words) this assumption doesn't hold up very
> >well.
> >
> >One solution I've heard is to extract the concepts of the documents, then
> >search on those. The concepts are still difficult to extract if the document
> >is too short, but it may provide a way to standardize documents. I have been
> >lazily looking for an open source, academic concept extractor, but I haven't
> >found one. There are companies like Semio and ActiveNavigation which provide
> >this service for an expense fee.
> >
> >Let me know if you find anything or have other ideas.
> >
> >--Peter
> >
> >
> >On 4/9/02 10:07 PM, "Melissa Mifsud" <melissamifsud@yahoo.com> wrote:
> >
> >>Hi,
> >>
> >>Documents which are shorter in length always seem to score higher in Lucene.
I
> >>was under the impression that the nornalization factors in the scoring
> >>function used by Lucene would improve this, however, after a couple of
> >>experiments, the short documents still always score the highest.
> >>
> >>Does anyone have any ideas as to how it is possible to make lengthier
> >>documents score higher?
> >>
> >>Also, I would like a way to boost documents according to the amount of
> >>in-links this document has.
> >>
> >>Has anyone implemented a type of Document.setBoost() method?
> >>
> >>I found a thread in the lucene-dev mailinglist where Doug Cutting mentions
> >>that this would be a great feature to add to Lucene. No one followed his
> >>email.
> >>
> >>Melissa.
> >>
> >
> >
> >--
> >To unsubscribe, e-mail:   <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> >For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
> >
> >
>
>
>
> --
> To unsubscribe, e-mail:   <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
>


--
To unsubscribe, e-mail:   <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>


Mime
View raw message