Return-Path: Delivered-To: apmail-lucene-dev-archive@www.apache.org Received: (qmail 47099 invoked from network); 1 Apr 2011 13:17:45 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 1 Apr 2011 13:17:45 -0000 Received: (qmail 91013 invoked by uid 500); 1 Apr 2011 13:17:44 -0000 Delivered-To: apmail-lucene-dev-archive@lucene.apache.org Received: (qmail 90959 invoked by uid 500); 1 Apr 2011 13:17:44 -0000 Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@lucene.apache.org Delivered-To: mailing list dev@lucene.apache.org Received: (qmail 90952 invoked by uid 99); 1 Apr 2011 13:17:44 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 01 Apr 2011 13:17:44 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=5.0 tests=ALL_TRUSTED,T_RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 01 Apr 2011 13:17:42 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id DFC158D226 for ; Fri, 1 Apr 2011 13:17:05 +0000 (UTC) Date: Fri, 1 Apr 2011 13:17:05 +0000 (UTC) From: "Robert Muir (JIRA)" To: dev@lucene.apache.org Message-ID: <687723250.27962.1301663825913.JavaMail.tomcat@hel.zones.apache.org> In-Reply-To: <1238437159.10432.1299754259722.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] [Commented] (LUCENE-2959) [GSoC] Implementing State of the Art Ranking for Lucene MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/LUCENE-2959?page=3Dcom.atlassia= n.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D130= 14547#comment-13014547 ]=20 Robert Muir commented on LUCENE-2959: ------------------------------------- {quote} One thing that is not clear for me is why these limitations would not be a = problem for BM25. As I see it, the difference between the two methods is th= at BM25 simply computes tfs, idfs and document length from the whole docume= nt =E2=80=93 which, according to what you said, is not available Lucene. Th= at's why I figured that a variant of BM25F would actually be more straightf= orward to implement. {quote} A variant sounds really interesting? I think you know better than me here, = I just looked at the original paper and thought to myself that to implement= this "by the book" might not be feasible for a while. {quote} Robert, would you be so kind to have a look at my proposal? It can be found= at http://www.google-melange.com/gsoc/proposal/review/google/gsoc2011/davi= dnemeskey/1. It's basically the same as what I sent to the mailing list. I = wrote that I want to implement BM25, BM25F and DFR ("the framework", I mean= t with one or two smoothing models), as well as to convert the original sco= ring to the new framework. In light of the thread here, I guess it would be= better to modify these goals, perhaps by: deleting the conversion part? committing myself to BM25/BM25F only? explicitly stating that I want a higher level API based on the low-level on= e? {quote} I think you can decide what you want to do? Obviously I would love to see a= ll of it done :) But its your choice, I could see you going a couple different ways: * closer to your original proposal, you could still develop a flexible scor= ing API on top of Similarity. Hey, all I did was move stuff from Scorer to = Similarity really, which does give flexibility, but its probably not what a= n IR researcher would want (its low-level and confusing). So you could make= a "SimpleSimilarity" or "EasySimilarity" or something thats presents a muc= h simpler API (something closer to what terrier/indri present) on top of th= is, for easily implementing ranking functions? I think this would be extrem= ely valuable long-term: who cares if we have a low-level flexible scoring A= PI that only speed demons like, but IR practitioners find confusing and hid= eous? Someone who is trying to experiment with an enhancement to relevance = likely doesn't care if their TREC run takes 30 seconds instead of 20 second= s if the API is really easy and they aren't wasting time fighting with luce= ne? If you go this route, you could implement BM25, DFR, etc as you suggest= ed as examples to how to use this API, and there would be more of a focus o= n API quality and simplicity instead of performance. * or alternatively, you could refine your proposal to implement a really "p= roduction strength" version of one of these scoring systems on top of the l= ow-level API, that would ideally have competitive performance/documentation= /etc with Lucene's default scoring today. If you decide to do this, then ye= s, I would definitely suggest picking only one, because I think its a ton o= f work as I listed above, and I think there would be more focus on practica= l things (some probably being nuances of lucene) and performance. > [GSoC] Implementing State of the Art Ranking for Lucene > ------------------------------------------------------- > > Key: LUCENE-2959 > URL: https://issues.apache.org/jira/browse/LUCENE-2959 > Project: Lucene - Java > Issue Type: New Feature > Components: Examples, Javadocs, Query/Scoring > Reporter: David Mark Nemeskey > Labels: gsoc2011, lucene-gsoc-11, mentor > Attachments: LUCENE-2959_mockdfr.patch, implementation_plan.pdf, = proposal.pdf > > > Lucene employs the Vector Space Model (VSM) to rank documents, which comp= ares > unfavorably to state of the art algorithms, such as BM25. Moreover, the a= rchitecture is > tailored specically to VSM, which makes the addition of new ranking funct= ions a non- > trivial task. > This project aims to bring state of the art ranking methods to Lucene and= to implement a > query architecture with pluggable ranking functions. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional commands, e-mail: dev-help@lucene.apache.org