Return-Path: Delivered-To: apmail-jakarta-lucene-user-archive@apache.org Received: (qmail 41776 invoked from network); 11 Apr 2002 14:35:16 -0000 Received: from unknown (HELO nagoya.betaversion.org) (192.18.49.131) by daedalus.apache.org with SMTP; 11 Apr 2002 14:35:16 -0000 Received: (qmail 15833 invoked by uid 97); 11 Apr 2002 14:35:14 -0000 Delivered-To: qmlist-jakarta-archive-lucene-user@jakarta.apache.org Received: (qmail 15812 invoked by uid 97); 11 Apr 2002 14:35:14 -0000 Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Users List" Reply-To: "Lucene Users List" Delivered-To: mailing list lucene-user@jakarta.apache.org Received: (qmail 15795 invoked from network); 11 Apr 2002 14:35:13 -0000 Errors-To: User-Agent: Microsoft-Entourage/10.0.0.1331 Date: Thu, 11 Apr 2002 07:35:07 -0700 Subject: Re: Normalization of Documents From: Peter Carlson To: Lucene Users List Message-ID: In-Reply-To: <50EA669584662B498F13A5F24630A0C0DB1746@peach.mnet.private> Mime-version: 1.0 Content-type: text/plain; charset="ISO-8859-1" Content-transfer-encoding: quoted-printable X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N Hi, These types of questions/discussions should be on the users list, not dev list, please. Just for the record, the Lucene scoring is not as simple as just a %. >From the FAQ. For the record, Lucene's scoring algorithm is, roughly: score_d =3D sum_t( tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t) =20 where: score_d : score for document d sum_t : sum for all terms t tf_q : the square root of the frequency of t in the query tf_d : the square root of the frequency of t in d idf_t : log(numDocs/docFreq_t+1) + 1.0 numDocs : number of documents in index docFreq_t : number of documents containing t norm_q : sqrt(sum_t((tf_q*idf_t)^2)) norm_d_t : square root of number of tokens in d in the same field as t (I hope that's right!) [Doug later added...] Make that: =20 score_d =3D sum_t(tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t * boost_t= ) * coord_q_d where boost_t : the user-specified boost for term t coord_q_d : number of terms in both query and document / number of terms in query The coordination factor gives an AND-like boost to documents that contain, e.g., all three terms in a three word query over those that contain just tw= o of the words. Although this may still not be what you want. You should be able to replace the scoring mechanism with your own. The problem you might run into is getting the document data (such as date) will slow down your search speed dramatically. Do you know about any solutions (Academic or free) that provide this concep= t extraction. I've heard of a group in the UK who worked on something like this. --Peter On 4/11/02 6:51 AM, "Hal=E1csy P=E9ter" wrote: > Extracting concept is not too easy thing and I don't think you can implem= ent a > language/context/document type independent solution. Filtering only impor= tant > terms of a text (and not index all text as in modern full text indexing > system) is one of the most important area of IR. A lot of project worked = on > this topic but nowadays it's not too important because we can index every > terms if we want (cheaper and faster disk, lot of CPU). >=20 > I think in lucene the the term's % of the document > (NUMBER_OF_WORDS_IN_THE_DOCUMENT / NUMBER_OF_QUERY_TERM_ACCURENCE )is > overweighted in some case. I would like to tune it if I could. >=20 > Document scoring could provide solution for me and I think for Melissa as > well. I think it's a very important feature of a modern IR system. For ex= ample > Melissa would use it to score the documents based on link popularity (or > impact factor/citation frequency). In my project I should score documents= on > their length and their age (more recent document is more valuable and ver= y old > documents are as valuable as very new in my archive). >=20 > peter >=20 >> -----Original Message----- >> From: Peter Carlson [mailto:carlson@bookandhammer.com] >> Sent: Wednesday, April 10, 2002 5:17 PM >> To: Lucene Developers List >> Subject: Re: Normalization of Documents >>=20 >>=20 >> I have noticed the same issue. >>=20 >> From what I understand, this is both the way it should work >> and a problem. >> Shorter documents which have a given term, should be more >> relevant because >> more of the document is about that term (i.e the term takes a >> greater % of >> the document). However, when there are documents of >> completely different >> sizes (i.e. 20 words vs. 2000 words) this assumption doesn't >> hold up very >> well. >>=20 >> One solution I've heard is to extract the concepts of the >> documents, then >> search on those. The concepts are still difficult to extract >> if the document >> is too short, but it may provide a way to standardize >> documents. I have been >> lazily looking for an open source, academic concept >> extractor, but I haven't >> found one. There are companies like Semio and >> ActiveNavigation which provide >> this service for an expense fee. >>=20 >> Let me know if you find anything or have other ideas. >>=20 >> --Peter >>=20 >>=20 >> On 4/9/02 10:07 PM, "Melissa Mifsud" wrote: >>=20 >>> Hi, >>>=20 >>> Documents which are shorter in length always seem to score >> higher in Lucene. I >>> was under the impression that the nornalization factors in >> the scoring >>> function used by Lucene would improve this, however, after >> a couple of >>> experiments, the short documents still always score the highest. >>>=20 >>> Does anyone have any ideas as to how it is possible to make >> lengthier >>> documents score higher? >>>=20 >>> Also, I would like a way to boost documents according to >> the amount of >>> in-links this document has. >>>=20 >>> Has anyone implemented a type of Document.setBoost() method? >>>=20 >>> I found a thread in the lucene-dev mailinglist where Doug >> Cutting mentions >>> that this would be a great feature to add to Lucene. No one >> followed his >>> email. >>>=20 >>> Melissa. >>>=20 >>=20 >>=20 >> -- >> To unsubscribe, e-mail: >> >> For additional commands, e-mail: >> >>=20 >>=20 >=20 > -- > To unsubscribe, e-mail: > For additional commands, e-mail: >=20 >=20 -- To unsubscribe, e-mail: For additional commands, e-mail: