Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 40751 invoked from network); 25 Jun 2007 07:19:56 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 25 Jun 2007 07:19:56 -0000 Received: (qmail 53781 invoked by uid 500); 25 Jun 2007 07:19:56 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 53566 invoked by uid 500); 25 Jun 2007 07:19:55 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 53555 invoked by uid 99); 25 Jun 2007 07:19:55 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 25 Jun 2007 00:19:55 -0700 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (herse.apache.org: domain of DORONC@il.ibm.com designates 195.212.29.151 as permitted sender) Received: from [195.212.29.151] (HELO mtagate2.de.ibm.com) (195.212.29.151) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 25 Jun 2007 00:19:50 -0700 Received: from d12nrmr1607.megacenter.de.ibm.com (d12nrmr1607.megacenter.de.ibm.com [9.149.167.49]) by mtagate2.de.ibm.com (8.13.8/8.13.8) with ESMTP id l5P7JTi3064302 for ; Mon, 25 Jun 2007 07:19:29 GMT Received: from d12av04.megacenter.de.ibm.com (d12av04.megacenter.de.ibm.com [9.149.165.229]) by d12nrmr1607.megacenter.de.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l5P7JOto917738 for ; Mon, 25 Jun 2007 09:19:29 +0200 Received: from d12av04.megacenter.de.ibm.com (loopback [127.0.0.1]) by d12av04.megacenter.de.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l5P7JMiG026482 for ; Mon, 25 Jun 2007 09:19:22 +0200 Received: from d12mc102.megacenter.de.ibm.com (d12mc102.megacenter.de.ibm.com [9.149.167.114]) by d12av04.megacenter.de.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l5P7JMKG026474 for ; Mon, 25 Jun 2007 09:19:22 +0200 Subject: search quality - assessment & improvements To: java-dev@lucene.apache.org X-Mailer: Lotus Notes Release 7.0 HF277 June 21, 2006 Message-ID: From: Doron Cohen Date: Mon, 25 Jun 2007 00:15:30 -0700 X-MIMETrack: Serialize by Router on D12MC102/12/M/IBM(Release 7.0.2HF71 | November 3, 2006) at 25/06/2007 10:19:21 MIME-Version: 1.0 Content-type: text/plain; charset=US-ASCII X-Virus-Checked: Checked by ClamAV on apache.org hi, this could probably split into two threads but for context let's start it in a single discussion; Recently I was looking at the search quality of Lucene - Recall and Precision, focused at P@1,5,10,20 and, mainly, MAP. -- Part 1 -- I found out that quality can be enhanced by modifying the doc length normalization, and by changing the tf() computation to also consider the average tf() in a single document. For the first change, logic is that Lucene's default length normalization punishes long documents too much. I found contrib's sweet-spot-similarity helpful here, but not enough. I found that a better doc-length normalization method is one that considers collection statistics - e.g. average doc length. The nice problem with such an approach is that you don't know the average length at indexing time, and it changes as the index evolves. The static nature of norms computation (and API) in Lucene is, while efficient, an obstacle for global computations. Another issue here is that applications often split documents into fields from reasons that are not "pure IR", for instance - content field and title field, just to be able to boost the title by (say) 3, but in fact, there is no "IR'ish" difference between finding the searched text in the title field or in the body field - they really serve/answer the same information need. For that matter, I believe that using a single document length when searching all these fields is more "accurate". For the second change logic, - assume two documents, doc1 containing 10 "A"'s, 10 "B"'s, and 10 "Z"'s, and doc2 containing "A" to "T" and 10 "Z"'s. Both doc1 and doc2 are of length 30. Searching for "Z", in both doc1 and doc2 tf("Z")=10. So, currently, doc1 and doc2 score the same for "Z", but the "truth" is that "Z" is much more representing/important in doc2 than it is in doc1, because its frequency in doc2 is 10 times more than all the other words in that doc, while in doc1 it is the same as the other words in that doc. If you agree about the potential improvement here, again, a nice problem is that current Similarity API does not even allow to consider this info (the average term frequency in the specific document) because Similarity.tf(int/float freq) takes only the frequency param. One way to open way for such computation is to add an "int docid" param to the Similarity class, but then the implementation of that class becomes IndexReader aware. Both modifications above have, in addition to API implications also performance implications, mainly search performance, and I would like to get some feedback on what people think about going in this direction... first the "if", only then the "how"... -- Part 2 -- It is very important that we would be able to assess the search quality in a repeatable manner - so that anyone can repeat the quality tests, and maybe find ways to improve them. (This would also allow to verify the "improvements claims" above...). This capability seems like a natural part of the benchmark package. I started to look at extending the benchmark package with search quality module, that would open an index (or first create one), run a set of queries (similar to the performance benchmark), and compute and report the set of known statistics mentioned above and more. Such a module depends on input data - documents, queries, and judgements. And that's my second question. We don't have to invent this data - TREC has it already, and it is getting wider every year as there are more judgements. So, theoretically we could use TREC data. One problem here is that TREC data should be purchased. Not sure that this is a problem - it is OK if we provide the mechanism to use this data for those who have it (Universities, for one). The other problem is that it is not clear to me what can one legally say on a certain system's results on TREC data. I would like the Search Quality Web page of Lucene to say something like: "MAP of XYZ for Track Z of TREC 2004", and then a certain submitted patch to say "I improved to 1.09*XYZ". But would that be legal? I just re-read their "Agreement Concerning Dissemination of TREC Results" - http://trec.nist.gov/act_part/forms/noads.html - and I am not feeling smarter about this. ----------- Thoughts? --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org