I have a related problem where I attempt to use Lucene as a duplicate
checking mechanism. I've found that it is very difficult to get Lucene to
give a decent probability of duplication because of this specific type of
weighting that is done.
Scott
> -----Original Message-----
> From: Halácsy Péter [mailto:halacsy.peter@axelero.com]
> Sent: Thursday, April 11, 2002 8:52 AM
> To: Lucene Developers List
> Subject: RE: Normalization of Documents
>
>
> Extracting concept is not too easy thing and I don't think
> you can implement a language/context/document type
> independent solution. Filtering only important terms of a
> text (and not index all text as in modern full text indexing
> system) is one of the most important area of IR. A lot of
> project worked on this topic but nowadays it's not too
> important because we can index every terms if we want
> (cheaper and faster disk, lot of CPU).
>
> I think in lucene the the term's % of the document
> (NUMBER_OF_WORDS_IN_THE_DOCUMENT /
> NUMBER_OF_QUERY_TERM_ACCURENCE )is overweighted in some case.
> I would like to tune it if I could.
>
> Document scoring could provide solution for me and I think
> for Melissa as well. I think it's a very important feature of
> a modern IR system. For example Melissa would use it to score
> the documents based on link popularity (or impact
> factor/citation frequency). In my project I should score
> documents on their length and their age (more recent document
> is more valuable and very old documents are as valuable as
> very new in my archive).
>
> peter
>
> > -----Original Message-----
> > From: Peter Carlson [mailto:carlson@bookandhammer.com]
> > Sent: Wednesday, April 10, 2002 5:17 PM
> > To: Lucene Developers List
> > Subject: Re: Normalization of Documents
> >
> >
> > I have noticed the same issue.
> >
> > From what I understand, this is both the way it should work
> > and a problem.
> > Shorter documents which have a given term, should be more
> > relevant because
> > more of the document is about that term (i.e the term takes a
> > greater % of
> > the document). However, when there are documents of
> > completely different
> > sizes (i.e. 20 words vs. 2000 words) this assumption doesn't
> > hold up very
> > well.
> >
> > One solution I've heard is to extract the concepts of the
> > documents, then
> > search on those. The concepts are still difficult to extract
> > if the document
> > is too short, but it may provide a way to standardize
> > documents. I have been
> > lazily looking for an open source, academic concept
> > extractor, but I haven't
> > found one. There are companies like Semio and
> > ActiveNavigation which provide
> > this service for an expense fee.
> >
> > Let me know if you find anything or have other ideas.
> >
> > --Peter
> >
> >
> > On 4/9/02 10:07 PM, "Melissa Mifsud"
> <melissamifsud@yahoo.com> wrote:
> >
> > > Hi,
> > >
> > > Documents which are shorter in length always seem to score
> > higher in Lucene. I
> > > was under the impression that the nornalization factors in
> > the scoring
> > > function used by Lucene would improve this, however, after
> > a couple of
> > > experiments, the short documents still always score the highest.
> > >
> > > Does anyone have any ideas as to how it is possible to make
> > lengthier
> > > documents score higher?
> > >
> > > Also, I would like a way to boost documents according to
> > the amount of
> > > in-links this document has.
> > >
> > > Has anyone implemented a type of Document.setBoost() method?
> > >
> > > I found a thread in the lucene-dev mailinglist where Doug
> > Cutting mentions
> > > that this would be a great feature to add to Lucene. No one
> > followed his
> > > email.
> > >
> > > Melissa.
> > >
> >
> >
> > --
> > To unsubscribe, e-mail:
> > <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> > For additional commands, e-mail:
> > <mailto:lucene-dev-help@jakarta.apache.org>
> >
> >
>
> --
> To unsubscribe, e-mail:
<mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
|