Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 60662 invoked from network); 26 Oct 2006 09:40:57 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 26 Oct 2006 09:40:57 -0000 Received: (qmail 63459 invoked by uid 500); 26 Oct 2006 09:41:05 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 63421 invoked by uid 500); 26 Oct 2006 09:41:05 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 63410 invoked by uid 99); 26 Oct 2006 09:41:05 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 26 Oct 2006 02:41:05 -0700 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received-SPF: pass (herse.apache.org: local policy) Received: from [137.108.246.32] (HELO venus.open.ac.uk) (137.108.246.32) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 26 Oct 2006 02:40:50 -0700 Received: from vostok.open.ac.uk ([137.108.140.139]) by venus.open.ac.uk with esmtp (Exim 4.62) (envelope-from ) id 1Gd1iq-0007Rl-5L for java-dev@lucene.apache.org; Thu, 26 Oct 2006 10:40:24 +0100 Received: from EPPING-EVS1.open.ac.uk ([137.108.170.245]) by vostok.open.ac.uk with Microsoft SMTPSVC(5.0.2195.6713); Thu, 26 Oct 2006 10:40:24 +0100 X-MimeOLE: Produced By Microsoft Exchange V6.5.7226.0 Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Subject: RE: wrong BM25 implementation in Lucene Date: Thu, 26 Oct 2006 10:40:23 +0100 Message-ID: X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: wrong BM25 implementation in Lucene Thread-Index: Acb4RmiC4qBYF/aQTo+4HLvK8YiQAwAnDBkg From: "J.Zhu" To: X-OriginalArrivalTime: 26 Oct 2006 09:40:24.0004 (UTC) FILETIME=[C2BC1040:01C6F8E2] X-Virus-Checked: Checked by ClamAV on apache.org Hi, Univ. of Amsterdam has provided a downloadable version of a language modelling version of Lucene. Their language model is not BM25 but is quite similar in nature. The version is at: http://ilps.science.uva.nl/Resources/#lm-lucen I have worked on their version a bit, they have created new classes: TermQueryLanguageModel, TermScorerLanguageModel, IndexSearcherLanguageModel, LanguageModelIndexReader etc. I think their work can be useful to you. If you have a successful implementation of BM25, would you be happy to share with us? Jianhan -----Original Message----- From: beatriz ramos [mailto:beatriz.ramos.moreno@gmail.com]=20 Sent: 25 October 2006 16:01 To: java-dev Subject: wrong BM25 implementation in Lucene Hello, this is BM25 algorithm I implement in Lucene. it doen't work because I have compaired my results with the results of MG4J (with the same documents set) I don't know if I have a wrong formule or there are another mistake Could you help me ? ------------------------------------------------------------------------ -------------------------------------------------------- public class BM25Scorer extends Scorer { private final static double EPSILON_SCORE =3D 1.000000082240371E-9; private final static double DEFAULT_K1 =3D 0.75d; private final static double DEFAULT_B =3D 0.95d; private double b =3D DEFAULT_B; private double k1 =3D DEFAULT_K1; private IndexReader reader; private Term term; private Hits hits; private int position; // document position in hits private IndexSearcher searcher; private int cooc =3D 0; // How many times a term appears in the document private float idf; public float score() throws IOException { TermFreqVector tfv =3D reader.getTermFreqVector( hits.id(position), term.field() ); String[] terms =3D tfv.getTerms(); int[] freqs =3D tfv.getTermFrequencies(); for (int i =3D 0 ; i < terms.length ; i++) { if( terms[i].equalsIgnoreCase(term.text()) ){ cooc =3D freqs[i]; } } idf =3D searcher.getSimilarity().idf(term, searcher); Document document =3D (Document)hits.doc(position); String[] values =3D document.getValues("DOCUMENT_LENGTH"); // document length is a field of my index long docLength =3D Long.valueOf(values[0]).longValue(); // document lenght (number of words) long averageLength =3D 200; double loga =3D Math.max( EPSILON_SCORE, new Float(idf ).doubleValue()); double score =3D ( loga * (k1 + 1) * cooc ) / (cooc + k1*( (1-b) = + (b*docLength/averageLength) ) ); return new Float(score).floatValue(); } --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org