Mailing-List: contact user-help@lucy.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@lucy.apache.org
From: Nick Wellnhofer <wellnhofer@aevum.de>
Content-Type: text/plain;
	charset=utf-8
Content-Transfer-Encoding: quoted-printable
Mime-Version: 1.0 (Mac OS X Mail 11.1 \(3445.4.7\))
Date: Tue, 21 Nov 2017 10:49:09 +0100
References: <pony-028d9b5d7a97a3b9f760ca80d5e077afa96ac73b-7fc700bc34cee3c7f2bfed2653cda31a80709ba0@user.lucy.apache.org>
To: user@lucy.apache.org
In-Reply-To: <pony-028d9b5d7a97a3b9f760ca80d5e077afa96ac73b-7fc700bc34cee3c7f2bfed2653cda31a80709ba0@user.lucy.apache.org>
Message-Id: <516E7E30-A0CE-4AB6-A32D-A211054BD4D6@aevum.de>
Subject: Re: [lucy-user] C library - Scoring mechanism
archived-at: Tue, 21 Nov 2017 09:49:21 -0000


On Nov 21, 2017, at 02:09 , serkanmulayim@gmail.com wrote:
> I have a question regarding the scoring mechanism for relevancy. Is =
the scoring mechanism tf/idf when the field indexed with the =
EasyAnalyzer in the schema? What happens when multiple terms are used? =
Are tf/idf's summed?

Lucy uses Lucene's Practical Scoring Function by default:

=
https://lucene.apache.org/core/3_6_0/api/core/org/apache/lucene/search/Sim=
ilarity.html

Essentially, tf/idf values are summed after being multiplied with each =
term's boost and normalization factor.

> How does the incorporate the location of the words to the scoring =
mechanism for queries with multiple words?

> How about the fields which has RegexTokenizer? Is it still the same =
mechanism? Does the type of the tokenizer affect the scoring?  I believe =
the important thing is the generated tokens (and not related to the =
tokenizer), and maybe the order of the tokens in a document.

If you use the core Tokenizers, the type of Tokenizer or the location of =
terms in a document don=E2=80=99t affect scoring. But you can write a =
custom Tokenizer that sets different boost values for each Token, for =
example depending on the location within the document.

> One more thing, if I were to change the scoring mechanism for =
different fields, how can I do it? Are there any predefined mechanisms =
eg. tf/idf doc2vec etc. Or if I want to go further and come up with my =
own how can I do it?

You can tweak the scoring formula by supplying your own Similarity =
subclass for each FieldType, possibly in conjunction with your own =
Query/Compiler/Matcher subclasses:

https://lucy.apache.org/docs/c/Lucy/Index/Similarity.html

The public documentation for Similarity is incomplete, unfortunately. =
But the class is similar to Lucene=E2=80=99s. The .cfh file contains =
more details:

=
https://git1-us-west.apache.org/repos/asf?p=3Dlucy.git;a=3Dblob;f=3Dcore/L=
ucy/Index/Similarity.cfh;h=3D15ec409dee06b19af1b855db50b4fef229dd314e;hb=3D=
HEAD

You=E2=80=99d typically override methods TF, IDF, Coord, Length_Norm, or =
Query_Norm.

Nick