Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 5947F200C8F for ; Fri, 9 Jun 2017 13:39:11 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 57B6A160BC8; Fri, 9 Jun 2017 11:39:11 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 9ECDA160B9C for ; Fri, 9 Jun 2017 13:39:10 +0200 (CEST) Received: (qmail 8382 invoked by uid 500); 9 Jun 2017 11:39:09 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 8360 invoked by uid 99); 9 Jun 2017 11:39:07 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 09 Jun 2017 11:39:07 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 4366AC22D3 for ; Fri, 9 Jun 2017 11:39:07 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 0.899 X-Spam-Level: X-Spam-Status: No, score=0.899 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-1, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id 854LjvGvKs7N for ; Fri, 9 Jun 2017 11:39:04 +0000 (UTC) Received: from mail-wr0-f172.google.com (mail-wr0-f172.google.com [209.85.128.172]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 6D1A25F2A9 for ; Fri, 9 Jun 2017 11:39:03 +0000 (UTC) Received: by mail-wr0-f172.google.com with SMTP id q97so30007344wrb.2 for ; Fri, 09 Jun 2017 04:39:03 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to; bh=1cPjwFsNPS4NHpTZfRKdUGYLO5aH8qrrxUpTRpyzk0s=; b=Qi1GatW/ogex67v0a9WBNlNAgvZ+JLWEYvy+MkEW7i87Xno63XttqzsSOmIoAw8/0H /VrPcDaqzeEetqJE0D1cEKaUxK0m1nOGxZKZbPr+TQ95H5TJbSwFg4KTrmoLMsp7GI8H 92YiKBl0Ba0IZTaBb5dTW+cCVLxzRViyDxZgmNgrat59RFXbWhzcY6q/m9hH7HQzTUvv /8OTCcfyizdd7EpNUCVhQwL9Hh1BisxClRHWsfaxAYvnDIK1btCHO1oHP7YaUKUciBT7 LfpR7nWeC/CcnxAe5IUBFqRHW7onkM/TlZDGAnP9o5igWHOcOHpW9hwZkGgDbQHawtu9 /4ug== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to; bh=1cPjwFsNPS4NHpTZfRKdUGYLO5aH8qrrxUpTRpyzk0s=; b=ozCqMDByZCqrGmaXIB1BoPu4qPU9977z+bNZQxL2EnjHhElYCAUZP+nPz7ZX48MWQF 4GjGsodwswFtLFDHlcwCXrrlmKl8gLHp5PsR9yK6McnFFtAA36z5UZ7+TeF5HvUOS+Nk FTdHaq0KrVD4n85vlf/3llv+sxnYFsGZpdVkfGiWb+YZuS0ujirsP/ygCtWRSkhU6o3B 7pkB3B9HOAICdzpTVu+Yb1lfVfzaBz9k/SZo0eoYl6zOhRvxn659GIdCUD5FQqrt9H7W a/fZQ6nnnYW+4Q3R5M2C6Wjdg6LQM//XxFRns6I7vOhHeRQKShJJl/jtPO4IRuH+p8RV aB0g== X-Gm-Message-State: AODbwcBQRZc2EN+8/Uu4bB9MYNtzERJ/IxgIimIvl1yI2bpXncQFoy8u tfcl/0ygDk0PsDOKHNTGwmT2N+EdN/4ePeM= X-Received: by 10.28.130.213 with SMTP id e204mr6772594wmd.33.1497008336712; Fri, 09 Jun 2017 04:38:56 -0700 (PDT) MIME-Version: 1.0 Received: by 10.28.105.82 with HTTP; Fri, 9 Jun 2017 04:38:56 -0700 (PDT) In-Reply-To: <2011949521.5580400.1496947013657@mail.yahoo.com> References: <2011949521.5580400.1496947013657@mail.yahoo.com> From: Jacek Grzebyta Date: Fri, 9 Jun 2017 12:38:56 +0100 Message-ID: Subject: Re: Penalize fact the searched term is within a world To: java-user@lucene.apache.org Content-Type: multipart/alternative; boundary="001a1143360022bd080551856a8d" archived-at: Fri, 09 Jun 2017 11:39:11 -0000 --001a1143360022bd080551856a8d Content-Type: text/plain; charset="UTF-8" Hi Ahmed, That works! Still I do not understand how that staff working. I just know that analysed cut an indexed text into tokens. But I do not know how the matching is done. Do you recommend and good book to read. I prefer something with less maths and more examples? The only I found is free "An Introduction to Information Retrieval" but I has lot of maths I do not understand. Best regards, Jacek On 8 June 2017 at 19:36, Ahmet Arslan wrote: > Hi, > You can completely ban within-a-word search by simply using > WhitespaceTokenizer for example.By the way, it is all about how you > tokenize/analyze your text. Once you decided, you can create a two versions > of a single field using different analysers.This allows you to assign > different weights to those field at query time. > Ahmet > > > On Thursday, June 8, 2017, 2:56:37 PM GMT+3, Jacek Grzebyta < > grzebyta.dev@gmail.com> wrote: > > > Hi, > > Apologies for repeating question from IRC room but I am not sure if that is > alive. > > I have no idea about how lucene works but I need to modify some part in > rdf4j project which depends on that. > > I need to use lucene to create a mapping file based on text searching and I > found there is a following problem. Let take a term 'abcd' which is mapped > to node 'abcd-2' whereas node 'abcd' exists. I found the issue is lucene is > searching the term and finds it in both nodes 'abcd' and 'abcd-2' and gives > the same score. My question is: how to modify the scoring to penalise the > fact the searched term is a part of longer word and give more score if that > is itself a word. > > Visually It looks like that: > > node 'abcd': > - name: abcd > > total score = LS /lucene score/ * 2.0 /name weight/ > > > > node 'abcd-2': > - name: abcd-2 > - alias1: abcd-h > - alias2: abcd-k9 > > total score = LS * 2.0 + LS * 0.5 /alias1 score/ + LS * 0.1 /alias2 score/ > > I gave different weights for properties. "Name" has the the highest weight > but "alias" has some small weight as well. In total the score for a node is > a sum of all partial score * weight. Finally 'abcd-2' has highest score > than 'abcd'. > > thanks, > Jacek > --001a1143360022bd080551856a8d--