lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Halácsy Péter <halacsy.pe...@axelero.com>
Subject RE: Normalization of Documents
Date Thu, 11 Apr 2002 13:51:57 GMT
Extracting concept is not too easy thing and I don't think you can implement a language/context/document
type independent solution. Filtering only important terms of a text (and not index all text
as in modern full text indexing system) is one of the most important area of IR. A lot of
project worked on this topic but nowadays it's not too important because we can index every
terms if we want (cheaper and faster disk, lot of CPU).

I think in lucene the the term's % of the document (NUMBER_OF_WORDS_IN_THE_DOCUMENT / NUMBER_OF_QUERY_TERM_ACCURENCE
)is overweighted in some case. I would like to tune it if I could.

Document scoring could provide solution for me and I think for Melissa as well. I think it's
a very important feature of a modern IR system. For example Melissa would use it to score
the documents based on link popularity (or impact factor/citation frequency). In my project
I should score documents on their length and their age (more recent document is more valuable
and very old documents are as valuable as very new in my archive).

peter

> -----Original Message-----
> From: Peter Carlson [mailto:carlson@bookandhammer.com]
> Sent: Wednesday, April 10, 2002 5:17 PM
> To: Lucene Developers List
> Subject: Re: Normalization of Documents
> 
> 
> I have noticed the same issue.
> 
> From what I understand, this is both the way it should work 
> and a problem.
> Shorter documents which have a given term, should be more 
> relevant because
> more of the document is about that term (i.e the term takes a 
> greater % of
> the document). However, when there are documents of 
> completely different
> sizes (i.e. 20 words vs. 2000 words) this assumption doesn't 
> hold up very
> well.
> 
> One solution I've heard is to extract the concepts of the 
> documents, then
> search on those. The concepts are still difficult to extract 
> if the document
> is too short, but it may provide a way to standardize 
> documents. I have been
> lazily looking for an open source, academic concept 
> extractor, but I haven't
> found one. There are companies like Semio and 
> ActiveNavigation which provide
> this service for an expense fee.
> 
> Let me know if you find anything or have other ideas.
> 
> --Peter
> 
> 
> On 4/9/02 10:07 PM, "Melissa Mifsud" <melissamifsud@yahoo.com> wrote:
> 
> > Hi,
> > 
> > Documents which are shorter in length always seem to score 
> higher in Lucene. I
> > was under the impression that the nornalization factors in 
> the scoring
> > function used by Lucene would improve this, however, after 
> a couple of
> > experiments, the short documents still always score the highest.
> > 
> > Does anyone have any ideas as to how it is possible to make 
> lengthier
> > documents score higher?
> > 
> > Also, I would like a way to boost documents according to 
> the amount of
> > in-links this document has.
> > 
> > Has anyone implemented a type of Document.setBoost() method?
> > 
> > I found a thread in the lucene-dev mailinglist where Doug 
> Cutting mentions
> > that this would be a great feature to add to Lucene. No one 
> followed his
> > email.
> > 
> > Melissa.
> > 
> 
> 
> --
> To unsubscribe, e-mail:   
> <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail: 
> <mailto:lucene-dev-help@jakarta.apache.org>
> 
> 

--
To unsubscribe, e-mail:   <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>


Mime
View raw message