Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id CAC73200C8F for ; Fri, 9 Jun 2017 15:19:43 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id C9480160BC8; Fri, 9 Jun 2017 13:19:43 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 19D8B160B9C for ; Fri, 9 Jun 2017 15:19:42 +0200 (CEST) Received: (qmail 63123 invoked by uid 500); 9 Jun 2017 13:19:41 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 63106 invoked by uid 99); 9 Jun 2017 13:19:41 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 09 Jun 2017 13:19:41 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 302ABC194C for ; Fri, 9 Jun 2017 13:19:41 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -0.897 X-Spam-Level: X-Spam-Status: No, score=-0.897 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-2.796, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id P29AOfMp4rQA for ; Fri, 9 Jun 2017 13:19:39 +0000 (UTC) Received: from mail-wr0-f179.google.com (mail-wr0-f179.google.com [209.85.128.179]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 738DC5F6C6 for ; Fri, 9 Jun 2017 13:19:38 +0000 (UTC) Received: by mail-wr0-f179.google.com with SMTP id g76so32842871wrd.1 for ; Fri, 09 Jun 2017 06:19:38 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to; bh=OJHXVbUXZBAsn25u6/kLLOQ53gJOKlI/QbFn7Ic/HVA=; b=HI3+9abhbB9oJidYvJF1zxKMSQw6OROv8he3uAZkhDzpkhovFZwR2sCr5vuygfE7bs OyFJYi7DcWVdqji7PgKuVzkSdrs8OnQYnadJEH/mLJPH0B/R4vZiYn27XFi0R/Iu5ILg q5/iX31qBGIV4jjdsGrjMlFUuf5tyjcuW4pt/1rlWHljkwHBHE5A7DuUI4weNaNMq3kH R0i3Dd3DvqmjiKomsyea7LhYmdn6RPYyoq6XRdmOHnyuZHBPmNQye7OpBbfOzDDAEvw3 FmNtjxdxIlotTMPvzwiKVQbkaTMB+BkL34vIYXgmcWw2jAhgI4JP2gEoSPsKIGv1SdNN LYtA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to; bh=OJHXVbUXZBAsn25u6/kLLOQ53gJOKlI/QbFn7Ic/HVA=; b=JI5RN8ZvAsyN00q0JpDXQ7YFnyqpEP8ogB+uWyBPw6m2KG3SBDE7UYL07s1cAXIl1O 2iI0cvqHWjQeM0SGkkW17cOYSkT+YsjUWEcaqOm7DLXgwxG1U2fO4dzuBpCjCVVrdsEt 01z6PY+Ucx9NN5lmmziphKmb6tt6CRlxDN1l6/gJ/pVzk90r6BSciWD8pAenAvc3LtVQ Fet+5ncs5pYQV/2SjABJ0VeucUaYeaC+AhsqzIsww+2fuJ5t6gIEd5nIvrZbZXKIf9rN u/UAyosk5YBEstOxm4Xi+xhJUtVfNsp0tkvLTO7jkqYYbPSrIGAD3YLlG44r5gBirNK+ KzCw== X-Gm-Message-State: AODbwcCTZ5cCH+dFNQm+IDM+S5vggiMmhEPqIC4qVzutxYA59jVN0Mlt RuMmyqrBf8Pwzv3buXnstwvOOOCQNHQ8P0M= X-Received: by 10.223.138.188 with SMTP id y57mr29355327wry.93.1497014376840; Fri, 09 Jun 2017 06:19:36 -0700 (PDT) MIME-Version: 1.0 Received: by 10.28.105.82 with HTTP; Fri, 9 Jun 2017 06:19:36 -0700 (PDT) In-Reply-To: <00df01d2e121$a8b94450$fa2bccf0$@thetaphi.de> References: <2011949521.5580400.1496947013657@mail.yahoo.com> <00df01d2e121$a8b94450$fa2bccf0$@thetaphi.de> From: Jacek Grzebyta Date: Fri, 9 Jun 2017 14:19:36 +0100 Message-ID: Subject: Re: Penalize fact the searched term is within a world To: java-user@lucene.apache.org Content-Type: multipart/alternative; boundary="001a1149818e27c4b4055186d2bd" archived-at: Fri, 09 Jun 2017 13:19:44 -0000 --001a1149818e27c4b4055186d2bd Content-Type: text/plain; charset="UTF-8" Unfortunately for the real data WhitespaceTokenizer does not work properly. I also tried KeywordAnalyzer because the data I need to process are just IDs but for that there is no output at all. On 9 June 2017 at 14:09, Uwe Schindler wrote: > Hi, > > the tokens are matched as is. It is only a match if the tokens are exactly > the same bytes. There are never done any substring matches, just simple > comparison of bytes. > > To have more fuzzier matches, you have to do text analysis right. This > includes splitting of tokens (Tokenizer), but also term "normalization" > (TokenFilters). One example is lowercasing (to allow case insensitive > matching), but also stemming might be done, or conversion to phonetic codes > (to allow phonetic matches). The output of the tokens does not necessarily > need to be "human readable" anymore. How does this work with matching, the > user won't enter phonetic codes? - Tokenization and normalization is done > on both the indexing as well as on the query side. If both sides produce > same tokens it's a match, very simple. By that information you should be > able to think about good ways to analyze the text for your use case. If you > use Solr, the schema.xml is your friend. In Lucene look at the analysis > module that has examples for common languages. If you want to do your own, > use CustomAnalyzer to create your own combination of tokenization and > normalization (filtering of tokens). > > Uwe > > ----- > Uwe Schindler > Achterdiek 19, D-28357 Bremen > http://www.thetaphi.de > eMail: uwe@thetaphi.de > > > -----Original Message----- > > From: Jacek Grzebyta [mailto:grzebyta.dev@gmail.com] > > Sent: Friday, June 9, 2017 1:39 PM > > To: java-user@lucene.apache.org > > Subject: Re: Penalize fact the searched term is within a world > > > > Hi Ahmed, > > > > That works! Still I do not understand how that staff working. I just know > > that analysed cut an indexed text into tokens. But I do not know how the > > matching is done. > > > > Do you recommend and good book to read. I prefer something with less > > maths > > and more examples? > > The only I found is free "An Introduction to Information Retrieval" but I > > has lot of maths I do not understand. > > > > Best regards, > > Jacek > > > > > > > > On 8 June 2017 at 19:36, Ahmet Arslan wrote: > > > > > Hi, > > > You can completely ban within-a-word search by simply using > > > WhitespaceTokenizer for example.By the way, it is all about how you > > > tokenize/analyze your text. Once you decided, you can create a two > > versions > > > of a single field using different analysers.This allows you to assign > > > different weights to those field at query time. > > > Ahmet > > > > > > > > > On Thursday, June 8, 2017, 2:56:37 PM GMT+3, Jacek Grzebyta < > > > grzebyta.dev@gmail.com> wrote: > > > > > > > > > Hi, > > > > > > Apologies for repeating question from IRC room but I am not sure if > that is > > > alive. > > > > > > I have no idea about how lucene works but I need to modify some part in > > > rdf4j project which depends on that. > > > > > > I need to use lucene to create a mapping file based on text searching > and I > > > found there is a following problem. Let take a term 'abcd' which is > mapped > > > to node 'abcd-2' whereas node 'abcd' exists. I found the issue is > lucene is > > > searching the term and finds it in both nodes 'abcd' and 'abcd-2' and > gives > > > the same score. My question is: how to modify the scoring to penalise > the > > > fact the searched term is a part of longer word and give more score if > that > > > is itself a word. > > > > > > Visually It looks like that: > > > > > > node 'abcd': > > > - name: abcd > > > > > > total score = LS /lucene score/ * 2.0 /name weight/ > > > > > > > > > > > > node 'abcd-2': > > > - name: abcd-2 > > > - alias1: abcd-h > > > - alias2: abcd-k9 > > > > > > total score = LS * 2.0 + LS * 0.5 /alias1 score/ + LS * 0.1 /alias2 > score/ > > > > > > I gave different weights for properties. "Name" has the the highest > weight > > > but "alias" has some small weight as well. In total the score for a > node is > > > a sum of all partial score * weight. Finally 'abcd-2' has highest score > > > than 'abcd'. > > > > > > thanks, > > > Jacek > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > > --001a1149818e27c4b4055186d2bd--