Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (nike.apache.org: domain of paul_t100@fastmail.fm
 designates 66.111.4.25 as permitted sender)
Message-ID: <4AA6DB1E.2040107@fastmail.fm>
Date: Tue, 08 Sep 2009 23:30:54 +0100
From: Paul Taylor <paul_t100@fastmail.fm>
Reply-To: paul_t100@fastmail.fm
User-Agent: Thunderbird 2.0.0.23 (Macintosh/20090812)
MIME-Version: 1.0
To: Michael Barbarelli <mbarbarelli@gmail.com>
CC: java-user@lucene.apache.org
Subject: Re: Is there way to get complete start end matches to be first in
 the 	list ?
References: <4AA6C202.8090406@fastmail.fm>
	 <a258c0620909081347n6a96d02fxa5994815f5515418@mail.gmail.com>
 <a258c0620909081351r27f46e07k700871454c1e1eed@mail.gmail.com>
In-Reply-To: <a258c0620909081351r27f46e07k700871454c1e1eed@mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

Michael Barbarelli wrote:
>
> What I do is run each entry in the hits collection through a 
> home-rolled levenstein distance algorithm to obtain a score. Then I 
> sort by score.
>
>> On Sep 8, 2009 9:44 PM, "Paul Taylor" <paul_t100@fastmail.fm 
>> <mailto:paul_t100@fastmail.fm>> wrote:
>>
>> Is there way to get complete start end matches to be first in the list
>>
>> We use Lucene to search song albums titles typically one to ten words 
>> long. If the user enter something like 'foo bar' everything that 
>> contains foo bar is returned with max score , thats fine but it would 
>> be better if an exact match is right at the top. Also although an OR 
>> Search has been entered would be great if that it ranked matches 
>> where both words are together higher than when they are not , but 
>> still return results that only match one condirtion.
>>
>> Ideally giving results in this order
>>
>>   * Foo Bar (exact match)
>>   * The Foo Bar Somethings (substring - exact match)
>>   * Bar Foo (all terms match)
>>   * Bar Baz and the Foo (substring - all terms match)
>>   * Foo (some terms match)
>>   * Foo Something (substring - some terms match)
>>
>>
>> Is there something I can do in Lucene, or some way I can modify the 
>> query (as entered by the user) to get results better aproaching this
>>
>>
>> Paul
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org 
>> <mailto:java-user-unsubscribe@lucene.apache.org>
>> For additional commands, e-mail: java-user-help@lucene.apache.org 
>> <mailto:java-user-help@lucene.apache.org>
>>
Thats sounds like the right algorithm but cannot this be done within 
Lucene. The trouble is say I get a 1000 hits, I only want the first 10 
but if I openly apply the algorithm to the first ten it might miss out 
on the 11th which should really be the 5th, but if have to get all 1000 
docs and apply algorithm its going to be a bit of an overhead.

Code excerpt might make it clearer:
        TopScoreDocCollector collector = 
TopScoreDocCollector.create(offset + limit, true);
        searcher.search(parser.parse(query), collector);
        Results results = new Results();
        TopDocs topDocs = collector.topDocs();
        results.offset = offset;
        results.totalHits = topDocs.totalHits;
        ScoreDoc docs[] = topDocs.scoreDocs;
        float maxScore = topDocs.getMaxScore();
        for (int i = offset; i < docs.length; i++) {
            Result result = new Result();
            result.score = docs[i].score / maxScore;
            result.doc = new MbDocument(searcher.doc(docs[i].doc));
            results.results.add(result);
        }
        return results;

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org