Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id DE172200C1A for ; Mon, 13 Feb 2017 13:53:19 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id DC987160B60; Mon, 13 Feb 2017 12:53:19 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 08EFF160B4D for ; Mon, 13 Feb 2017 13:53:18 +0100 (CET) Received: (qmail 28568 invoked by uid 500); 13 Feb 2017 12:53:12 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 28556 invoked by uid 99); 13 Feb 2017 12:53:12 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 13 Feb 2017 12:53:12 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 87102C0E29 for ; Mon, 13 Feb 2017 12:53:11 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.28 X-Spam-Level: * X-Spam-Status: No, score=1.28 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, KAM_ASCII_DIVIDERS=0.8, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RCVD_IN_SORBS_SPAM=0.5] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=mikemccandless-com.20150623.gappssmtp.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id jQzNekIX8_eq for ; Mon, 13 Feb 2017 12:53:09 +0000 (UTC) Received: from mail-it0-f65.google.com (mail-it0-f65.google.com [209.85.214.65]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id CD2765F30A for ; Mon, 13 Feb 2017 12:53:08 +0000 (UTC) Received: by mail-it0-f65.google.com with SMTP id e137so13257145itc.0 for ; Mon, 13 Feb 2017 04:53:08 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=mikemccandless-com.20150623.gappssmtp.com; s=20150623; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-transfer-encoding; bh=BuPiMw94Ye/0LDPq/6bk+4smIkaFoC892Fw1EHlesjk=; b=oCzWD/oz70EoNByvyOVePZxhp8T7hTfYZAyk+L2uAMnELRqNsYfazjK/5IeYSJfIj5 ul/rfsQjjrLE66gGgCwtQpQGZrsfx/hnc9Jvxtp/0rmgDzy8EGv9Sq/S1EDNYunRficN O5l84pLFRLS62wWJrUCHzKFCI26v+dM2PhK/oRsfp/ok04HT8U/4ydMgBew8zx5F8k8O dxPVJKHmypvB9UNLKkknfSue1I6Kj1rRJ4ZVpADvpbY+j2Z6mnHqHQkMoaRz9LOO7+n6 T79OobIri5gMLrOj5mN/7sKGYpx4DhW6Ln27sAOoXvwkB2jK40/wVVAWYRKU7bJiEY0X HjuQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:content-transfer-encoding; bh=BuPiMw94Ye/0LDPq/6bk+4smIkaFoC892Fw1EHlesjk=; b=SXbIEIXm933yzuw6ws2WPgVAJ4Pu8pShUEqis4J/MhtxwkQ09SNaME1ewIOuhtLdxC ozxjerxNS01xxkHcBOcthNDzVA75xcUWpxxZ1280XUFx6XBbwJFLXnH2GjngpSNnW+Mb nV9iYk6lVVqLgYoTVmcndcyosNjGXyF7u1mxuJKRydByDhWCojheJw0VhpzQi9+Hd3TC H0L7zAEYkgtfIxQ4jW73Qg+znTv6HKRfFTcVF872aP2H/v/C8bcxM0b2X6JGd2XI5uNC cu2DpQSAmQF3JH4TwCB8t0TAHuYfMbTE0n+mZKquy3jb5UmQDDWGQ9VjGBUSGCWdLKiS tY7A== X-Gm-Message-State: AMke39muW6iYEzM7sKWsLFJAl3fMkw2sTNZRBUhyOOq+bZHjkucFaheyXt5Pw3d4FfFLf9dDONowOejeGZrlpw== X-Received: by 10.36.107.131 with SMTP id v125mr20675434itc.73.1486990387321; Mon, 13 Feb 2017 04:53:07 -0800 (PST) MIME-Version: 1.0 Received: by 10.107.132.38 with HTTP; Mon, 13 Feb 2017 04:52:46 -0800 (PST) In-Reply-To: References: <5a1c6576-319d-01be-b089-cd93dce5c2e1@uni-bielefeld.de> <15381_1486467299_v17BYvhS006975_CAL8Pwkb6MW5CQe8_-o-JGpUGOr1Z_s54-w2iHuXHCRf+aUA7yA@mail.gmail.com> <23977_1486492326_v17IW4CD018167_CAL8PwkZiSZvAAuvGfOp=Sao7fVZPZqzXeFJkQ1PO0tgCxypkvg@mail.gmail.com> <05326ffd-e005-2250-4d4e-0ea17f00d3ab@uni-bielefeld.de> <24386_1486661968_v19HdRZX013914_CAL8PwkYw4tqJ-fP6A2qA2xFmuoSHdpM+hSWiiQ0VQ2SvkRT66A@mail.gmail.com> <77580b04-c901-5e80-0b96-1bfe4639c779@uni-bielefeld.de> <19267_1486770399_v1ANkbwt029897_CAL8PwkYGYYKbJLiXkbtqV39oHxvMXBKs0jiWWrXuRQZJ-HxrzA@mail.gmail.com> From: Michael McCandless Date: Mon, 13 Feb 2017 07:52:46 -0500 Message-ID: Subject: Re: SynonymFilterFactory deprecated since 6.4.0 To: Lucene Users Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable archived-at: Mon, 13 Feb 2017 12:53:20 -0000 Thanks Bernd; I'll see if I can make a test case from this. Mike McCandless http://blog.mikemccandless.com On Mon, Feb 13, 2017 at 5:00 AM, Bernd Fehling wrote: > My very simple and small sysonym_test.txt has only one line: > naturwald, natural\ forest, for=C3=AAt\ naturelle, nat=C3=BCrlicher\ wald > > If I only use WT (WhitespaceTokenizer) and SGF (with WhitespaceTokenizer) > the result is: > > WT text start end positionLength type position > natural 0 7 1 word 1 > forest 8 14 1 word 2 > > SGF text start end positionLength type position > natural 0 7 3 word 1 > naturelle 0 14 3 SYNONYM 2 > wald 0 14 2 SYNONYM 3 > naturwald 0 14 4 SYNONYM 1 > for=C3=AAt 0 14 1 SYNONYM 1 > nat=C3=BCrlicher 0 14 2 SYNONYM 1 > > forest 8 14 1 word 4 > > The result is some kind of rubbish. > Also note the empty line between "nat=C3=BCrlicher" and "forest". > > Anything else I should try, may be with KeywordTokenizer? > > p.s. You might have noticed the SPF filters in my setup. > First is SynonymPreFilter to set all attributes to the right value, > second is SynonymPostFilter to again fix all attribute settings but > also set multi-word synonyms as phrase and also cleanup the result > of SGF. > > Regards > Bernd > > Am 11.02.2017 um 00:45 schrieb Michael McCandless: >> Yeah, those tokens should have position length 2. >> >> Can you reduce to a small set of synonyms and text? If you use only >> whitespace tokenizer and SGF does the issue reproduce? >> >> Mike McCandless >> >> http://blog.mikemccandless.com >> >> >> On Fri, Feb 10, 2017 at 10:07 AM, Bernd Fehling >> wrote: >>> Example for position end and positionLength of SGF. >>> >>> query: natural forest >>> >>> WT text start end positionLength type position >>> natural 0 7 1 word 1 >>> forest 8 14 1 word 2 >>> ... >>> >>> SPF text start end positionLength type position >>> natural 0 7 1 word 1 >>> natural forest 0 14 2 shingle 2 >>> forest 8 14 1 word 3 >>> >>> SGF text start end positionLength type position >>> natural 0 7 1 word 1 >>> naturwald 0 14 1 SYNONYM 2 >>> for=C3=AAt naturelle 0 14 1 SYNONYM 2 >>> nat=C3=BCrlicher wald 0 14 1 SYNONYM 2 >>> natural forest 0 14 1 shingle 2 >>> forest 8 14 1 word 3 >>> >>> SPF text start end positionLength type position >>> natural 0 7 1 word 1 >>> naturwald 0 9 1 SYNONYM 2 >>> "for=C3=AAt naturelle" 0 17 2 SYNONYM 2 >>> "nat=C3=BCrlicher wald" 0 18 2 SYNONYM 2 >>> "natural forest" 0 16 2 shingle 2 >>> forest 8 14 1 word 3 >>> >>> >>> SGF (SynonymsGraphFilter) has for all SYNONYM's the same position end a= nd positionLength. >>> I suppose that it is not correct? >>> >>> Regards >>> Bernd >>> >>> >>> Am 09.02.2017 um 18:39 schrieb Michael McCandless: >>>> On Thu, Feb 9, 2017 at 2:40 AM, Bernd Fehling >>>> wrote: >>>>> I tried SynonymGraphFilter with my setup and it works right away. >>>>> It payed of that I did some modifications on my filters while >>>>> testing 6.3 with my setup. >>>> >>>> Good! >>>> >>>>> I only replaced SynonymFilter with SynonymGraphFilter and did not >>>>> use FlattenGraphFilter, pretty simple. So I can confirm that, up >>>>> to this point, SynonymGraphFilter is a full replacement for >>>>> SynonymFilter. At least for search-time synonym handling. >>>>> >>>>> But this also means there is still some work with the attributes, rig= ht? >>>>> Position looks good, type and start are no problem anyway, but >>>>> the end position is still wrong and the positionLength for multi-word >>>>> synonyms. >>>> >>>> Can you give an example or make a small test case? >>>> PositionLengthAttribute is supposed to be correct coming out of >>>> SynonymGraphFilter. >>>> >>>>> One thing I noticed was that the originating token which "produces" >>>>> synonyms comes out last from SynonymGraphFilter, after the >>>>> "produced" synonyms. >>>>> I will have a look inside with debugger but I guess this is due >>>>> to output buffering of SynonymGraphFilter? >>>> >>>> Yeah they do come out in a different order, which token filters are >>>> allowed to do in general for all tokens leaving from the same position >>>> ... >>>> >>>> Mike McCandless >>>> >>>> http://blog.mikemccandless.com >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >>>> For additional commands, e-mail: java-user-help@lucene.apache.org >>>> >>> >>> -- >>> ************************************************************* >>> Bernd Fehling Bielefeld University Library >>> Dipl.-Inform. (FH) LibTec - Library Technology >>> Universit=C3=A4tsstr. 25 and Knowledge Management >>> 33615 Bielefeld >>> Tel. +49 521 106-4060 bernd.fehling(at)uni-bielefeld.de >>> >>> BASE - Bielefeld Academic Search Engine - www.base-search.net >>> ************************************************************* >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >>> For additional commands, e-mail: java-user-help@lucene.apache.org >>> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >> For additional commands, e-mail: java-user-help@lucene.apache.org >> > > -- > ************************************************************* > Bernd Fehling Bielefeld University Library > Dipl.-Inform. (FH) LibTec - Library Technology > Universit=C3=A4tsstr. 25 and Knowledge Management > 33615 Bielefeld > Tel. +49 521 106-4060 bernd.fehling(at)uni-bielefeld.de > > BASE - Bielefeld Academic Search Engine - www.base-search.net > ************************************************************* > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org