lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shai Erera <ser...@gmail.com>
Subject Re: Is there way to get complete start end matches to be first in the list ?
Date Wed, 09 Sep 2009 03:44:45 GMT
I can think of a way where you rely solely on scores and therefore there is
still chance to get results not ordered the way you want, but you can try it
- run the query [foo bar OR "foo bar"^10]. That way, your first result
should be scored by [foo], [bar] and ["foo bar"]. Also, the phrase is added
a score modifier so that it will score 10 times more than regular matches.
You can change it to 2, 5, 100 whichever works for you.

There is still some remote chance that the above approach will rank a "bar
foo" higher than "foo bar" but it depends on your content. If you don't
think this can happen in your content, then I'd try to above query.

Shai

On Wed, Sep 9, 2009 at 1:30 AM, Paul Taylor <paul_t100@fastmail.fm> wrote:

> Michael Barbarelli wrote:
>
>>
>> What I do is run each entry in the hits collection through a home-rolled
>> levenstein distance algorithm to obtain a score. Then I sort by score.
>>
>>  On Sep 8, 2009 9:44 PM, "Paul Taylor" <paul_t100@fastmail.fm <mailto:
>>> paul_t100@fastmail.fm>> wrote:
>>>
>>> Is there way to get complete start end matches to be first in the list
>>>
>>> We use Lucene to search song albums titles typically one to ten words
>>> long. If the user enter something like 'foo bar' everything that contains
>>> foo bar is returned with max score , thats fine but it would be better if an
>>> exact match is right at the top. Also although an OR Search has been entered
>>> would be great if that it ranked matches where both words are together
>>> higher than when they are not , but still return results that only match one
>>> condirtion.
>>>
>>> Ideally giving results in this order
>>>
>>>  * Foo Bar (exact match)
>>>  * The Foo Bar Somethings (substring - exact match)
>>>  * Bar Foo (all terms match)
>>>  * Bar Baz and the Foo (substring - all terms match)
>>>  * Foo (some terms match)
>>>  * Foo Something (substring - some terms match)
>>>
>>>
>>> Is there something I can do in Lucene, or some way I can modify the query
>>> (as entered by the user) to get results better aproaching this
>>>
>>>
>>> Paul
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org <mailto:
>>> java-user-unsubscribe@lucene.apache.org>
>>> For additional commands, e-mail: java-user-help@lucene.apache.org<mailto:
>>> java-user-help@lucene.apache.org>
>>>
>>>  Thats sounds like the right algorithm but cannot this be done within
> Lucene. The trouble is say I get a 1000 hits, I only want the first 10 but
> if I openly apply the algorithm to the first ten it might miss out on the
> 11th which should really be the 5th, but if have to get all 1000 docs and
> apply algorithm its going to be a bit of an overhead.
>
> Code excerpt might make it clearer:
>       TopScoreDocCollector collector = TopScoreDocCollector.create(offset +
> limit, true);
>       searcher.search(parser.parse(query), collector);
>       Results results = new Results();
>       TopDocs topDocs = collector.topDocs();
>       results.offset = offset;
>       results.totalHits = topDocs.totalHits;
>       ScoreDoc docs[] = topDocs.scoreDocs;
>       float maxScore = topDocs.getMaxScore();
>       for (int i = offset; i < docs.length; i++) {
>           Result result = new Result();
>           result.score = docs[i].score / maxScore;
>           result.doc = new MbDocument(searcher.doc(docs[i].doc));
>           results.results.add(result);
>       }
>       return results;
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message