Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 50405 invoked from network); 8 Sep 2009 22:38:50 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 8 Sep 2009 22:38:50 -0000 Received: (qmail 27819 invoked by uid 500); 8 Sep 2009 22:31:29 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 27740 invoked by uid 500); 8 Sep 2009 22:31:29 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 27729 invoked by uid 99); 8 Sep 2009 22:31:28 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 08 Sep 2009 22:31:28 +0000 X-ASF-Spam-Status: No, hits=-1.0 required=10.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of paul_t100@fastmail.fm designates 66.111.4.25 as permitted sender) Received: from [66.111.4.25] (HELO out1.smtp.messagingengine.com) (66.111.4.25) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 08 Sep 2009 22:31:17 +0000 Received: from compute1.internal (compute1.internal [10.202.2.41]) by gateway1.messagingengine.com (Postfix) with ESMTP id D3A1A68F1E; Tue, 8 Sep 2009 18:30:55 -0400 (EDT) Received: from heartbeat2.messagingengine.com ([10.202.2.161]) by compute1.internal (MEProxy); Tue, 08 Sep 2009 18:30:55 -0400 DKIM-Signature: v=1; a=rsa-sha1; c=relaxed/relaxed; d=messagingengine.com; h=message-id:date:from:reply-to:mime-version:to:cc:subject:references:in-reply-to:content-type:content-transfer-encoding; s=smtpout; bh=wGX/JhWLOnmgH/6MOTxPHwhnV/U=; b=cBennYp/WwtjpWc5qXbQdXVwRgg2RIlCk67265/6KKGNvEqaJxv5j70ighv/nTlNzhLMNMi8K3t12HDZuNvlHPcCOLSuSeLxmQNdKIKLumPAP+wYTgv8uPX2wYAliaUCaVZTyaJoHwfriMDhKCoQtZj+8ulshNJg9woH93ufsnM= X-Sasl-enc: WiKnh4ldAPw8oKOdwc5WUUNFvqP2NcQ8Dzr2nwqmbeGr 1252449055 Received: from macbook.lan (unknown [217.155.98.246]) by mail.messagingengine.com (Postfix) with ESMTPA id 0699821203; Tue, 8 Sep 2009 18:30:54 -0400 (EDT) Message-ID: <4AA6DB1E.2040107@fastmail.fm> Date: Tue, 08 Sep 2009 23:30:54 +0100 From: Paul Taylor Reply-To: paul_t100@fastmail.fm User-Agent: Thunderbird 2.0.0.23 (Macintosh/20090812) MIME-Version: 1.0 To: Michael Barbarelli CC: java-user@lucene.apache.org Subject: Re: Is there way to get complete start end matches to be first in the list ? References: <4AA6C202.8090406@fastmail.fm> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org Michael Barbarelli wrote: > > What I do is run each entry in the hits collection through a > home-rolled levenstein distance algorithm to obtain a score. Then I > sort by score. > >> On Sep 8, 2009 9:44 PM, "Paul Taylor" > > wrote: >> >> Is there way to get complete start end matches to be first in the list >> >> We use Lucene to search song albums titles typically one to ten words >> long. If the user enter something like 'foo bar' everything that >> contains foo bar is returned with max score , thats fine but it would >> be better if an exact match is right at the top. Also although an OR >> Search has been entered would be great if that it ranked matches >> where both words are together higher than when they are not , but >> still return results that only match one condirtion. >> >> Ideally giving results in this order >> >> * Foo Bar (exact match) >> * The Foo Bar Somethings (substring - exact match) >> * Bar Foo (all terms match) >> * Bar Baz and the Foo (substring - all terms match) >> * Foo (some terms match) >> * Foo Something (substring - some terms match) >> >> >> Is there something I can do in Lucene, or some way I can modify the >> query (as entered by the user) to get results better aproaching this >> >> >> Paul >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >> >> For additional commands, e-mail: java-user-help@lucene.apache.org >> >> Thats sounds like the right algorithm but cannot this be done within Lucene. The trouble is say I get a 1000 hits, I only want the first 10 but if I openly apply the algorithm to the first ten it might miss out on the 11th which should really be the 5th, but if have to get all 1000 docs and apply algorithm its going to be a bit of an overhead. Code excerpt might make it clearer: TopScoreDocCollector collector = TopScoreDocCollector.create(offset + limit, true); searcher.search(parser.parse(query), collector); Results results = new Results(); TopDocs topDocs = collector.topDocs(); results.offset = offset; results.totalHits = topDocs.totalHits; ScoreDoc docs[] = topDocs.scoreDocs; float maxScore = topDocs.getMaxScore(); for (int i = offset; i < docs.length; i++) { Result result = new Result(); result.score = docs[i].score / maxScore; result.doc = new MbDocument(searcher.doc(docs[i].doc)); results.results.add(result); } return results; --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org