lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jacek Grzebyta <grzebyta....@gmail.com>
Subject Re: Penalize fact the searched term is within a world
Date Fri, 09 Jun 2017 11:38:56 GMT
Hi Ahmed,

That works! Still I do not understand how that staff working. I just know
that analysed cut an indexed text into tokens. But I do not know how the
matching is done.

Do you recommend and good book to read. I prefer something with less maths
and more examples?
The only I found is free "An Introduction to Information Retrieval" but I
has lot of maths I do not understand.

Best regards,
Jacek



On 8 June 2017 at 19:36, Ahmet Arslan <iorixxx@yahoo.com.invalid> wrote:

> Hi,
> You can completely ban within-a-word search by simply using
> WhitespaceTokenizer for example.By the way, it is all about how you
> tokenize/analyze your text. Once you decided, you can create a two versions
> of a single field using different analysers.This allows you to assign
> different weights to those field at query time.
> Ahmet
>
>
> On Thursday, June 8, 2017, 2:56:37 PM GMT+3, Jacek Grzebyta <
> grzebyta.dev@gmail.com> wrote:
>
>
> Hi,
>
> Apologies for repeating question from IRC room but I am not sure if that is
> alive.
>
> I have no idea about how lucene works but I need to modify some part in
> rdf4j project which depends on that.
>
> I need to use lucene to create a mapping file based on text searching and I
> found there is a following problem. Let take a term 'abcd' which is mapped
> to node 'abcd-2' whereas node 'abcd' exists. I found the issue is lucene is
> searching the term and finds it in both nodes 'abcd' and 'abcd-2' and gives
> the same score. My question is: how to modify the scoring to penalise the
> fact the searched term is a part of longer word and give more score if that
> is itself a word.
>
> Visually It looks like that:
>
> node 'abcd':
>   - name: abcd
>
> total score = LS /lucene score/ * 2.0 /name weight/
>
>
>
> node 'abcd-2':
>   - name: abcd-2
>   - alias1: abcd-h
>   - alias2: abcd-k9
>
> total score = LS * 2.0 + LS * 0.5 /alias1 score/ + LS * 0.1 /alias2 score/
>
> I gave different weights for properties. "Name" has the the highest weight
> but "alias" has some small weight as well. In total the score for a node is
> a sum of all partial score * weight. Finally 'abcd-2' has highest score
> than 'abcd'.
>
> thanks,
> Jacek
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message