Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 3740 invoked from network); 4 Jul 2007 17:55:45 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 4 Jul 2007 17:55:45 -0000 Received: (qmail 18418 invoked by uid 500); 4 Jul 2007 17:55:42 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 18370 invoked by uid 500); 4 Jul 2007 17:55:42 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 18341 invoked by uid 99); 4 Jul 2007 17:55:41 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 04 Jul 2007 10:55:41 -0700 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received-SPF: neutral (herse.apache.org: local policy) Received: from [64.81.62.48] (HELO mail01.apmindsf.com) (64.81.62.48) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 04 Jul 2007 10:55:38 -0700 Received: from [172.30.62.126] (helo=h00215.dev.sfo1.metaweb.com) by mail01.apmindsf.com with esmtp (Exim 4.50) id 1I694O-0005KR-Fi; Wed, 04 Jul 2007 10:55:16 -0700 Received: from localhost (localhost [127.0.0.1]) by h00215.dev.sfo1.metaweb.com (Postfix) with ESMTP id 581BB1755B8; Wed, 4 Jul 2007 14:05:23 -0400 (EDT) X-Virus-Scanned: amavisd-new at X-Spam-Score: -2.679 X-Spam-Level: Received: from h00215.dev.sfo1.metaweb.com ([127.0.0.1]) by localhost (h00215.dev.sfo1.metaweb.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id cfau5Vn0Ojys; Wed, 4 Jul 2007 14:05:18 -0400 (EDT) Received: from h00215.dev.sfo1.metaweb.com (zimbra.metaweb.com [172.30.62.126]) by h00215.dev.sfo1.metaweb.com (Postfix) with ESMTP id 22AFB175827; Wed, 4 Jul 2007 14:05:18 -0400 (EDT) Date: Wed, 4 Jul 2007 14:05:17 -0400 (EDT) From: Tim Sturge To: java-user@lucene.apache.org Cc: hossman_lucene@fucit.org Message-ID: <20153459.717571183572317952.JavaMail.root@h00215> In-Reply-To: <30247405.717551183571286058.JavaMail.root@h00215> Subject: Re: product based term combination for BooleanQuery? MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [70.231.236.182] X-Virus-Checked: Checked by ClamAV on apache.org X-Old-Spam-Status: No, score=-2.679 tagged_above=-10 required=6.6 tests=[AWL=-0.080, BAYES_00=-2.599] :-) The use of wikipedia data here is no secret; it's all over www.freebase.com. I just hoped to avoid being sucked into a "what is the best way to index wikipedia with Lucene?" discussion, which I believe several other groups are already tackling. At index time, I used a per document boost (over all fields) and a per field bost (over all documents). I can certainly factor out the first into a query boost, but I was under the impression that if I ever wanted to combine fields (eg to index all "name" "alias" and "title" data in a single "head" field) then I had to pre-boost the data prior to combining it. I tend to believe that these (short) fields contain more relevant information than (long) wikipedia articles or other documents. Should idf and tf take care of that short/long quality distinction? It sounds like you feel they should. I'll build an index without the per field boost and see if that produces improved results. Thanks, Tim ----- Original Message ----- From: "Chris Hostetter" To: "Lucene Users" Sent: Tuesday, July 3, 2007 10:26:57 PM (GMT-0800) America/Los_Angeles Subject: Re: product based term combination for BooleanQuery? (side note: if you are going to try and obfuscate your field names when sending explain output so we don't know you are using wikipedia data (not that we care), please at least be consistent about it so the final explanations actual make sense -- it will save everyone a lot of confusion and help us help you) the biggest factor in your scores seems to be the fieldNorms for your name, title and alias fields ... they are so high, that tf and idf are pretty much irrelevant. By the looks of it, when you were indexing your docs, you used a consistent field boost per field on every instance of that field for every document ... this is really not a use case where index time field (or document) boosts make sense. in my opinion hte number one thing you can do to imrpove your relevency right now is to stop using index time boosts and use query boosts instead. If you don't want to reindex completely the LengthNormModifier class (in the misc contrib) can update all of your norms in place without reindexing and throw away any index time boosts you had. -Hoss --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org