lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler" <...@thetaphi.de>
Subject RE: Penalize fact the searched term is within a world
Date Fri, 09 Jun 2017 13:09:42 GMT
Hi,

the tokens are matched as is. It is only a match if the tokens are exactly the same bytes.
There are never done any substring matches, just simple comparison of bytes.

To have more fuzzier matches, you have to do text analysis right. This includes splitting
of tokens (Tokenizer), but also term "normalization" (TokenFilters). One example is lowercasing
(to allow case insensitive matching), but also stemming might be done, or conversion to phonetic
codes (to allow phonetic matches). The output of the tokens does not necessarily need to be
"human readable" anymore. How does this work with matching, the user won't enter phonetic
codes? - Tokenization and normalization is done on both the indexing as well as on the query
side. If both sides produce same tokens it's a match, very simple. By that information you
should be able to think about good ways to analyze the text for your use case. If you use
Solr, the schema.xml is your friend. In Lucene look at the analysis module that has examples
for common languages. If you want to do your own, use CustomAnalyzer to create your own combination
of tokenization and normalization (filtering of tokens).

Uwe

-----
Uwe Schindler
Achterdiek 19, D-28357 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de

> -----Original Message-----
> From: Jacek Grzebyta [mailto:grzebyta.dev@gmail.com]
> Sent: Friday, June 9, 2017 1:39 PM
> To: java-user@lucene.apache.org
> Subject: Re: Penalize fact the searched term is within a world
> 
> Hi Ahmed,
> 
> That works! Still I do not understand how that staff working. I just know
> that analysed cut an indexed text into tokens. But I do not know how the
> matching is done.
> 
> Do you recommend and good book to read. I prefer something with less
> maths
> and more examples?
> The only I found is free "An Introduction to Information Retrieval" but I
> has lot of maths I do not understand.
> 
> Best regards,
> Jacek
> 
> 
> 
> On 8 June 2017 at 19:36, Ahmet Arslan <iorixxx@yahoo.com.invalid> wrote:
> 
> > Hi,
> > You can completely ban within-a-word search by simply using
> > WhitespaceTokenizer for example.By the way, it is all about how you
> > tokenize/analyze your text. Once you decided, you can create a two
> versions
> > of a single field using different analysers.This allows you to assign
> > different weights to those field at query time.
> > Ahmet
> >
> >
> > On Thursday, June 8, 2017, 2:56:37 PM GMT+3, Jacek Grzebyta <
> > grzebyta.dev@gmail.com> wrote:
> >
> >
> > Hi,
> >
> > Apologies for repeating question from IRC room but I am not sure if that is
> > alive.
> >
> > I have no idea about how lucene works but I need to modify some part in
> > rdf4j project which depends on that.
> >
> > I need to use lucene to create a mapping file based on text searching and I
> > found there is a following problem. Let take a term 'abcd' which is mapped
> > to node 'abcd-2' whereas node 'abcd' exists. I found the issue is lucene is
> > searching the term and finds it in both nodes 'abcd' and 'abcd-2' and gives
> > the same score. My question is: how to modify the scoring to penalise the
> > fact the searched term is a part of longer word and give more score if that
> > is itself a word.
> >
> > Visually It looks like that:
> >
> > node 'abcd':
> >   - name: abcd
> >
> > total score = LS /lucene score/ * 2.0 /name weight/
> >
> >
> >
> > node 'abcd-2':
> >   - name: abcd-2
> >   - alias1: abcd-h
> >   - alias2: abcd-k9
> >
> > total score = LS * 2.0 + LS * 0.5 /alias1 score/ + LS * 0.1 /alias2 score/
> >
> > I gave different weights for properties. "Name" has the the highest weight
> > but "alias" has some small weight as well. In total the score for a node is
> > a sum of all partial score * weight. Finally 'abcd-2' has highest score
> > than 'abcd'.
> >
> > thanks,
> > Jacek
> >


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message