Return-Path: Delivered-To: apmail-lucene-general-archive@www.apache.org Received: (qmail 99832 invoked from network); 11 Mar 2010 02:03:31 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 11 Mar 2010 02:03:31 -0000 Received: (qmail 6549 invoked by uid 500); 11 Mar 2010 02:02:59 -0000 Delivered-To: apmail-lucene-general-archive@lucene.apache.org Received: (qmail 6508 invoked by uid 500); 11 Mar 2010 02:02:59 -0000 Mailing-List: contact general-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: general@lucene.apache.org Delivered-To: mailing list general@lucene.apache.org Received: (qmail 6500 invoked by uid 99); 11 Mar 2010 02:02:58 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 11 Mar 2010 02:02:58 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [208.69.42.181] (HELO radix.cryptio.net) (208.69.42.181) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 11 Mar 2010 02:02:51 +0000 Received: by radix.cryptio.net (Postfix, from userid 1007) id 73CDB71C353; Wed, 10 Mar 2010 18:02:30 -0800 (PST) Received: from localhost (localhost [127.0.0.1]) by radix.cryptio.net (Postfix) with ESMTP id 71A6671C329 for ; Wed, 10 Mar 2010 18:02:30 -0800 (PST) Date: Wed, 10 Mar 2010 18:02:30 -0800 (PST) From: Chris Hostetter To: general@lucene.apache.org Subject: Re: How to do prefix/phrase matching with term-length-sensitive scoring? In-Reply-To: <1267630059.30186.2.camel@seraphim> Message-ID: References: <1267630059.30186.2.camel@seraphim> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Virus-Checked: Checked by ClamAV on apache.org : Given a list of prefixes, what is the simplest way to match them against : a text field, giving preference to shorter term matches? I would suggest using Edge based NGrams, sorting on a numeric field containing the "length" of the term. : * Term frequency within the field must be ignored when scoring. You can omit term frequeny info when indexing (sorting will make it irrelevent, but no reason to waste the space) : * Documents and fields are sometimes boosted at index time; norms are : present. Hmmm, well that makes the sorting more complicated, but in that case you can either include the boost value into your special "length" field to have your own magic number for sorting the results, or you use a function query based approach to meld the (norm influenced) score with your own length field. -Hoss