lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Scott Ganyo <scott.ga...@eTapestry.com>
Subject RE: Normalization of Documents
Date Thu, 11 Apr 2002 14:18:39 GMT
I have a related problem where I attempt to use Lucene as a duplicate
checking mechanism.  I've found that it is very difficult to get Lucene to
give a decent probability of duplication because of this specific type of
weighting that is done.

Scott

> -----Original Message-----
> From: Halácsy Péter [mailto:halacsy.peter@axelero.com]
> Sent: Thursday, April 11, 2002 8:52 AM
> To: Lucene Developers List
> Subject: RE: Normalization of Documents
> 
> 
> Extracting concept is not too easy thing and I don't think 
> you can implement a language/context/document type 
> independent solution. Filtering only important terms of a 
> text (and not index all text as in modern full text indexing 
> system) is one of the most important area of IR. A lot of 
> project worked on this topic but nowadays it's not too 
> important because we can index every terms if we want 
> (cheaper and faster disk, lot of CPU).
> 
> I think in lucene the the term's % of the document 
> (NUMBER_OF_WORDS_IN_THE_DOCUMENT / 
> NUMBER_OF_QUERY_TERM_ACCURENCE )is overweighted in some case. 
> I would like to tune it if I could.
> 
> Document scoring could provide solution for me and I think 
> for Melissa as well. I think it's a very important feature of 
> a modern IR system. For example Melissa would use it to score 
> the documents based on link popularity (or impact 
> factor/citation frequency). In my project I should score 
> documents on their length and their age (more recent document 
> is more valuable and very old documents are as valuable as 
> very new in my archive).
> 
> peter
> 
> > -----Original Message-----
> > From: Peter Carlson [mailto:carlson@bookandhammer.com]
> > Sent: Wednesday, April 10, 2002 5:17 PM
> > To: Lucene Developers List
> > Subject: Re: Normalization of Documents
> > 
> > 
> > I have noticed the same issue.
> > 
> > From what I understand, this is both the way it should work 
> > and a problem.
> > Shorter documents which have a given term, should be more 
> > relevant because
> > more of the document is about that term (i.e the term takes a 
> > greater % of
> > the document). However, when there are documents of 
> > completely different
> > sizes (i.e. 20 words vs. 2000 words) this assumption doesn't 
> > hold up very
> > well.
> > 
> > One solution I've heard is to extract the concepts of the 
> > documents, then
> > search on those. The concepts are still difficult to extract 
> > if the document
> > is too short, but it may provide a way to standardize 
> > documents. I have been
> > lazily looking for an open source, academic concept 
> > extractor, but I haven't
> > found one. There are companies like Semio and 
> > ActiveNavigation which provide
> > this service for an expense fee.
> > 
> > Let me know if you find anything or have other ideas.
> > 
> > --Peter
> > 
> > 
> > On 4/9/02 10:07 PM, "Melissa Mifsud" 
> <melissamifsud@yahoo.com> wrote:
> > 
> > > Hi,
> > > 
> > > Documents which are shorter in length always seem to score 
> > higher in Lucene. I
> > > was under the impression that the nornalization factors in 
> > the scoring
> > > function used by Lucene would improve this, however, after 
> > a couple of
> > > experiments, the short documents still always score the highest.
> > > 
> > > Does anyone have any ideas as to how it is possible to make 
> > lengthier
> > > documents score higher?
> > > 
> > > Also, I would like a way to boost documents according to 
> > the amount of
> > > in-links this document has.
> > > 
> > > Has anyone implemented a type of Document.setBoost() method?
> > > 
> > > I found a thread in the lucene-dev mailinglist where Doug 
> > Cutting mentions
> > > that this would be a great feature to add to Lucene. No one 
> > followed his
> > > email.
> > > 
> > > Melissa.
> > > 
> > 
> > 
> > --
> > To unsubscribe, e-mail:   
> > <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> > For additional commands, e-mail: 
> > <mailto:lucene-dev-help@jakarta.apache.org>
> > 
> > 
> 
> --
> To unsubscribe, e-mail:   
<mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message