Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
MIME-Version: 1.0
In-Reply-To: <00df01d2e121$a8b94450$fa2bccf0$@thetaphi.de>
References: <CAH+cQJno0a8aVE8PSQ0ogVidh7dFFdxJq8=BoOA_-aCU5zx0mg@mail.gmail.com>
 <2011949521.5580400.1496947013657@mail.yahoo.com> <CAH+cQJkk8V4C_-e=NmmyHkccafMKZafDHmVGxh33oKKv=N+n8A@mail.gmail.com>
 <00df01d2e121$a8b94450$fa2bccf0$@thetaphi.de>
From: Jacek Grzebyta <grzebyta.dev@gmail.com>
Date: Fri, 9 Jun 2017 14:19:36 +0100
Message-ID: <CAH+cQJn5zfn62j6DqB9P_5K_SPwFKp5s66xez8DxUD7cCsgDzA@mail.gmail.com>
Subject: Re: Penalize fact the searched term is within a world
To: java-user@lucene.apache.org
Content-Type: multipart/alternative; boundary="001a1149818e27c4b4055186d2bd"
archived-at: Fri, 09 Jun 2017 13:19:44 -0000

--001a1149818e27c4b4055186d2bd
Content-Type: text/plain; charset="UTF-8"

Unfortunately for the real data WhitespaceTokenizer does not work properly.
I also tried KeywordAnalyzer because the data I need to process are just
IDs but for that there is no output at all.


On 9 June 2017 at 14:09, Uwe Schindler <uwe@thetaphi.de> wrote:

> Hi,
>
> the tokens are matched as is. It is only a match if the tokens are exactly
> the same bytes. There are never done any substring matches, just simple
> comparison of bytes.
>
> To have more fuzzier matches, you have to do text analysis right. This
> includes splitting of tokens (Tokenizer), but also term "normalization"
> (TokenFilters). One example is lowercasing (to allow case insensitive
> matching), but also stemming might be done, or conversion to phonetic codes
> (to allow phonetic matches). The output of the tokens does not necessarily
> need to be "human readable" anymore. How does this work with matching, the
> user won't enter phonetic codes? - Tokenization and normalization is done
> on both the indexing as well as on the query side. If both sides produce
> same tokens it's a match, very simple. By that information you should be
> able to think about good ways to analyze the text for your use case. If you
> use Solr, the schema.xml is your friend. In Lucene look at the analysis
> module that has examples for common languages. If you want to do your own,
> use CustomAnalyzer to create your own combination of tokenization and
> normalization (filtering of tokens).
>
> Uwe
>
> -----
> Uwe Schindler
> Achterdiek 19, D-28357 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
>
> > -----Original Message-----
> > From: Jacek Grzebyta [mailto:grzebyta.dev@gmail.com]
> > Sent: Friday, June 9, 2017 1:39 PM
> > To: java-user@lucene.apache.org
> > Subject: Re: Penalize fact the searched term is within a world
> >
> > Hi Ahmed,
> >
> > That works! Still I do not understand how that staff working. I just know
> > that analysed cut an indexed text into tokens. But I do not know how the
> > matching is done.
> >
> > Do you recommend and good book to read. I prefer something with less
> > maths
> > and more examples?
> > The only I found is free "An Introduction to Information Retrieval" but I
> > has lot of maths I do not understand.
> >
> > Best regards,
> > Jacek
> >
> >
> >
> > On 8 June 2017 at 19:36, Ahmet Arslan <iorixxx@yahoo.com.invalid> wrote:
> >
> > > Hi,
> > > You can completely ban within-a-word search by simply using
> > > WhitespaceTokenizer for example.By the way, it is all about how you
> > > tokenize/analyze your text. Once you decided, you can create a two
> > versions
> > > of a single field using different analysers.This allows you to assign
> > > different weights to those field at query time.
> > > Ahmet
> > >
> > >
> > > On Thursday, June 8, 2017, 2:56:37 PM GMT+3, Jacek Grzebyta <
> > > grzebyta.dev@gmail.com> wrote:
> > >
> > >
> > > Hi,
> > >
> > > Apologies for repeating question from IRC room but I am not sure if
> that is
> > > alive.
> > >
> > > I have no idea about how lucene works but I need to modify some part in
> > > rdf4j project which depends on that.
> > >
> > > I need to use lucene to create a mapping file based on text searching
> and I
> > > found there is a following problem. Let take a term 'abcd' which is
> mapped
> > > to node 'abcd-2' whereas node 'abcd' exists. I found the issue is
> lucene is
> > > searching the term and finds it in both nodes 'abcd' and 'abcd-2' and
> gives
> > > the same score. My question is: how to modify the scoring to penalise
> the
> > > fact the searched term is a part of longer word and give more score if
> that
> > > is itself a word.
> > >
> > > Visually It looks like that:
> > >
> > > node 'abcd':
> > >   - name: abcd
> > >
> > > total score = LS /lucene score/ * 2.0 /name weight/
> > >
> > >
> > >
> > > node 'abcd-2':
> > >   - name: abcd-2
> > >   - alias1: abcd-h
> > >   - alias2: abcd-k9
> > >
> > > total score = LS * 2.0 + LS * 0.5 /alias1 score/ + LS * 0.1 /alias2
> score/
> > >
> > > I gave different weights for properties. "Name" has the the highest
> weight
> > > but "alias" has some small weight as well. In total the score for a
> node is
> > > a sum of all partial score * weight. Finally 'abcd-2' has highest score
> > > than 'abcd'.
> > >
> > > thanks,
> > > Jacek
> > >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

--001a1149818e27c4b4055186d2bd--