Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 4B747200D43 for ; Tue, 21 Nov 2017 10:49:21 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 49FF3160BFC; Tue, 21 Nov 2017 09:49:21 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 8CB58160BED for ; Tue, 21 Nov 2017 10:49:20 +0100 (CET) Received: (qmail 13604 invoked by uid 500); 21 Nov 2017 09:49:19 -0000 Mailing-List: contact user-help@lucy.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@lucy.apache.org Delivered-To: mailing list user@lucy.apache.org Received: (qmail 13593 invoked by uid 99); 21 Nov 2017 09:49:19 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 21 Nov 2017 09:49:19 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id A7154180724 for ; Tue, 21 Nov 2017 09:49:18 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 0.999 X-Spam-Level: X-Spam-Status: No, score=0.999 tagged_above=-999 required=6.31 tests=[KAM_LAZY_DOMAIN_SECURITY=1, RP_MATCHES_RCVD=-0.001] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id D9EAkUUkD5FJ for ; Tue, 21 Nov 2017 09:49:16 +0000 (UTC) Received: from hosting04.aevum.de (hosting04.aevum.de [188.68.58.30]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 317055FCCF for ; Tue, 21 Nov 2017 09:49:16 +0000 (UTC) Received: from [192.168.182.98] (p4FF57C70.dip0.t-ipconnect.de [79.245.124.112]) by hosting04.aevum.de (Postfix) with ESMTPSA id 0B6A5603B8 for ; Tue, 21 Nov 2017 10:49:10 +0100 (CET) From: Nick Wellnhofer Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Mime-Version: 1.0 (Mac OS X Mail 11.1 \(3445.4.7\)) Date: Tue, 21 Nov 2017 10:49:09 +0100 References: To: user@lucy.apache.org In-Reply-To: Message-Id: <516E7E30-A0CE-4AB6-A32D-A211054BD4D6@aevum.de> X-Mailer: Apple Mail (2.3445.4.7) Subject: Re: [lucy-user] C library - Scoring mechanism archived-at: Tue, 21 Nov 2017 09:49:21 -0000 On Nov 21, 2017, at 02:09 , serkanmulayim@gmail.com wrote: > I have a question regarding the scoring mechanism for relevancy. Is = the scoring mechanism tf/idf when the field indexed with the = EasyAnalyzer in the schema? What happens when multiple terms are used? = Are tf/idf's summed? Lucy uses Lucene's Practical Scoring Function by default: = https://lucene.apache.org/core/3_6_0/api/core/org/apache/lucene/search/Sim= ilarity.html Essentially, tf/idf values are summed after being multiplied with each = term's boost and normalization factor. > How does the incorporate the location of the words to the scoring = mechanism for queries with multiple words? > How about the fields which has RegexTokenizer? Is it still the same = mechanism? Does the type of the tokenizer affect the scoring? I believe = the important thing is the generated tokens (and not related to the = tokenizer), and maybe the order of the tokens in a document. If you use the core Tokenizers, the type of Tokenizer or the location of = terms in a document don=E2=80=99t affect scoring. But you can write a = custom Tokenizer that sets different boost values for each Token, for = example depending on the location within the document. > One more thing, if I were to change the scoring mechanism for = different fields, how can I do it? Are there any predefined mechanisms = eg. tf/idf doc2vec etc. Or if I want to go further and come up with my = own how can I do it? You can tweak the scoring formula by supplying your own Similarity = subclass for each FieldType, possibly in conjunction with your own = Query/Compiler/Matcher subclasses: https://lucy.apache.org/docs/c/Lucy/Index/Similarity.html The public documentation for Similarity is incomplete, unfortunately. = But the class is similar to Lucene=E2=80=99s. The .cfh file contains = more details: = https://git1-us-west.apache.org/repos/asf?p=3Dlucy.git;a=3Dblob;f=3Dcore/L= ucy/Index/Similarity.cfh;h=3D15ec409dee06b19af1b855db50b4fef229dd314e;hb=3D= HEAD You=E2=80=99d typically override methods TF, IDF, Coord, Length_Norm, or = Query_Norm. Nick