Return-Path: X-Original-To: apmail-lucene-solr-user-archive@minotaur.apache.org Delivered-To: apmail-lucene-solr-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 7768AD645 for ; Thu, 15 Nov 2012 19:04:44 +0000 (UTC) Received: (qmail 34318 invoked by uid 500); 15 Nov 2012 19:04:41 -0000 Delivered-To: apmail-lucene-solr-user-archive@lucene.apache.org Received: (qmail 34107 invoked by uid 500); 15 Nov 2012 19:04:41 -0000 Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: solr-user@lucene.apache.org Delivered-To: mailing list solr-user@lucene.apache.org Received: (qmail 34099 invoked by uid 99); 15 Nov 2012 19:04:41 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 15 Nov 2012 19:04:41 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [209.85.216.48] (HELO mail-qa0-f48.google.com) (209.85.216.48) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 15 Nov 2012 19:04:34 +0000 Received: by mail-qa0-f48.google.com with SMTP id s11so1607028qaa.14 for ; Thu, 15 Nov 2012 11:04:13 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:x-gm-message-state; bh=AFJJDj4t/0NsOBSZMI6FJrb1+I0b+yUfEGH2vBN31T8=; b=d3+XCMNtyx6km75n5HVG1+f6NM6BWiac/cP8aG0XYFV86bS9K4Rv/k2mUHL+xY6T3I 4/G7H3lYMKakIleoLoLfJokX2m01ldOoNlNhd/KL1V5vDHIi/zCFOlKRSVAxIulRDFpe 4zYRMdA5/XcfqOwh0NiL4+0sE09cdJntMSnnmxJdvfEwnl61HPvXXzsEjHY+8Eyi8TB+ 9iACBYYMxTelzBXEHRhbGEcuRF62c7xxZz4gQyzV99qUQDb6Ys4cX2Evxki1G0cXzalh FLpTaPGyRD4FNEoJMhf97n59QVvuii9ymb6w59Mrb37h7OqklvAbclJdRCJKhMJ82Ni0 2QUg== MIME-Version: 1.0 Received: by 10.229.195.167 with SMTP id ec39mr438178qcb.38.1353006253171; Thu, 15 Nov 2012 11:04:13 -0800 (PST) Received: by 10.229.162.205 with HTTP; Thu, 15 Nov 2012 11:04:12 -0800 (PST) In-Reply-To: References: Date: Thu, 15 Nov 2012 14:04:12 -0500 Message-ID: Subject: Re: BM25 model for solr 4? From: Tom Burton-West To: solr-user@lucene.apache.org Content-Type: multipart/alternative; boundary=0016e6d285b419651b04ce8d5120 X-Gm-Message-State: ALoCoQmUWzfeQaz3tTOvzuu31tQmalnqAXjwMWGeSs/BzdmYQ/A0Bh4BXjmsZxpK3WEOPslkpYmB X-Virus-Checked: Checked by ClamAV on apache.org --0016e6d285b419651b04ce8d5120 Content-Type: text/plain; charset=ISO-8859-1 Hello Floyd, There is a ton of research literature out there comparing BM25 to vector space. But you have to be careful interpreting it. BM25 originally beat the SMART vector space model in the early TRECs because it did better tf and length normalization. Pivoted Document Length normalization was invented to get the vector space model to catch up to BM25. (Just Google for Singhal length normalization. Amith Singhal, now chief of Google Search did his doctoral thesis on this and it is available. Similarly Stephan Robertson, now at Microsoft Research published a ton of studies of BM25) The default Solr/Lucene similarity class doesn't provide the length or tf normalization tuning params that BM25 does. There is the sweetspot simliarity, but that doesn't quite work the same way that the BM25 normalizations do. Document length normalization needs and parameter tuning all depends on your data. So if you are reading a comparison, you need to determine: 1) When comparing recall/precision etc. between vector space and Bm25, did the experimenter tune both the vector space and the BM25 parameters 2) Are the documents (and queries) they are using in the test, similar in length characteristics to your documents and queries. We are planning to do some testing in the next few months for our use case, which is 10 million books where we index the entire book. These are extremely long documents compared to most IR research. I'd love to hear about actual (non-research) production implementations that have tested the new ranking models available in Solr. Tom On Wed, Nov 14, 2012 at 9:16 PM, Floyd Wu wrote: > Hi there, > Does anybody can kindly tell me how to setup solr to use BM25? > By the way, are there any experiment or research shows BM25 and classical > VSM model comparison in recall/precision rate? > > Thanks in advanced. > --0016e6d285b419651b04ce8d5120--