Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 42421200C1A for ; Mon, 13 Feb 2017 17:32:19 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 40A6B160B60; Mon, 13 Feb 2017 16:32:19 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 608C4160B4A for ; Mon, 13 Feb 2017 17:32:18 +0100 (CET) Received: (qmail 13313 invoked by uid 500); 13 Feb 2017 16:32:17 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 13301 invoked by uid 99); 13 Feb 2017 16:32:16 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 13 Feb 2017 16:32:16 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 6990EC0115 for ; Mon, 13 Feb 2017 16:32:16 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -0.22 X-Spam-Level: X-Spam-Status: No, score=-0.22 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RCVD_IN_SORBS_SPAM=0.5] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=mikemccandless-com.20150623.gappssmtp.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id HXBr8z53A1JY for ; Mon, 13 Feb 2017 16:32:14 +0000 (UTC) Received: from mail-it0-f43.google.com (mail-it0-f43.google.com [209.85.214.43]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 43AD65F252 for ; Mon, 13 Feb 2017 16:32:13 +0000 (UTC) Received: by mail-it0-f43.google.com with SMTP id x75so11435051itb.0 for ; Mon, 13 Feb 2017 08:32:13 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=mikemccandless-com.20150623.gappssmtp.com; s=20150623; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-transfer-encoding; bh=iOPaY6Pl6E7YxPoy1+KdHkLov1FFfrvHPUxTvdH4AlY=; b=kj+LhPKwXPGR04EEP4TXsodtHgitHVy0c68XAIHNPJJTdx0pnNW/5dBZV3mjfuCJxT pCYK87Ln1A+uGa8uYGlIXgX1mAoxmmv5fbyF1KKgNxCKT1w5IH4VHjFxlWM0EYUGJMZI nCCF7XaeSo9noDxxqKov4XjQ5U70fnZaeW8c8xnu2zvsQDsTZzvHRuKGqJfD9IaKOWyH UfUHytvt6TgTvtxhZ4ycl6kFIifrCLPKeDWRh5Cm8APENi1w0ZehQDjafvuWULNwGBD2 m1nA8zt0eSO7I9fnrzJHDgpo3RBBzojz2lRR3hy3jsfCZmA0INGltoLqJzKEsaYBLrur /FKQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:content-transfer-encoding; bh=iOPaY6Pl6E7YxPoy1+KdHkLov1FFfrvHPUxTvdH4AlY=; b=XGz5Tp50lTMx9kP+frv7VNSCxPIU3avchIY4I3TW9WOuS1vY9wJffW7bQwEYy1QstL nmyCOW9YyXvsd3XBZYD0d1OgcfwDU1Ss2eyeFjDN4U70yUzf84bMe1MxgYuRPPadCnqn s0VAMTgTpeS8Zqx6r8KYa++EmrhDQGDmtmHtx3FbqmYH8qK5QcsYth0MRiaRq8Kxpn6T 5YVp8LAymt/RNuTP8InZSpki6rr0l95+aEH4YDMnFfP3XZW45K6bKSnhG2Bo/MwVz3hx XQMiwfq84xu2MlSszGR4oomnSvyEAgz6/RbuBsZ2vDNFaKKpMi5j+5FQoXL1fINMPkNv zaYA== X-Gm-Message-State: AIkVDXIQZ17oOuz9547a7hD4jTtyPcDt5k5GVQmxesZDJiQOPx+TSCsN8nWM+Nfx+TNwzeMEDXjefS/Af3fZOg== X-Received: by 10.36.236.3 with SMTP id g3mr41527487ith.56.1487003531622; Mon, 13 Feb 2017 08:32:11 -0800 (PST) MIME-Version: 1.0 Received: by 10.107.132.38 with HTTP; Mon, 13 Feb 2017 08:31:50 -0800 (PST) In-Reply-To: References: <5a1c6576-319d-01be-b089-cd93dce5c2e1@uni-bielefeld.de> <15381_1486467299_v17BYvhS006975_CAL8Pwkb6MW5CQe8_-o-JGpUGOr1Z_s54-w2iHuXHCRf+aUA7yA@mail.gmail.com> <23977_1486492326_v17IW4CD018167_CAL8PwkZiSZvAAuvGfOp=Sao7fVZPZqzXeFJkQ1PO0tgCxypkvg@mail.gmail.com> <05326ffd-e005-2250-4d4e-0ea17f00d3ab@uni-bielefeld.de> <24386_1486661968_v19HdRZX013914_CAL8PwkYw4tqJ-fP6A2qA2xFmuoSHdpM+hSWiiQ0VQ2SvkRT66A@mail.gmail.com> <77580b04-c901-5e80-0b96-1bfe4639c779@uni-bielefeld.de> <19267_1486770399_v1ANkbwt029897_CAL8PwkYGYYKbJLiXkbtqV39oHxvMXBKs0jiWWrXuRQZJ-HxrzA@mail.gmail.com> <8083_1486992274_v1DDOXHg017064_CAL8Pwkar_WBsz=JpVSHPvaFiSp5ve-9YVkBD5mD83XvO+DWvVw@mail.gmail.com> From: Michael McCandless Date: Mon, 13 Feb 2017 11:31:50 -0500 Message-ID: Subject: Re: SynonymFilterFactory deprecated since 6.4.0 To: Lucene Users , Bernd Fehling Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable archived-at: Mon, 13 Feb 2017 16:32:19 -0000 On Mon, Feb 13, 2017 at 9:04 AM, Bernd Fehling wrote: > Am I confused by the naming of pos, positionIncrement, offset, positionLe= ngth, > start and end between Lucene and Solr? "pos" is just accumulating the positionIncrement values, starting from -1. I don't think Solr's analysis UI would change the meaning of these attributes. > OK, the SynonymGraphFilter is ONLY for Lucene, right? No, it's also for Solr and Elasticsearch and any other search servers on top of Lucene as well. > But how are you going to build the multi-word synonym query "nat=C3=BCrli= cher wald" > from "natural forest"? Lucene's and Elasticsearch's query parsers have already been fixed to correctly handle token graphs by default; Solr has a fork of Lucene's query parser I think ... I'm not sure if it's been fixed yet to interpret graphs. See e.g. https://issues.apache.org/jira/browse/LUCENE-7603 and https://issues.apache.org/jira/browse/LUCENE-7638 > And how are you going to highlight a synonym hit for "nat=C3=BCrlicher wa= ld" > when start and end is set to 0-14 and not to 0-18? > Or is start and end not used for highlighting? This start/end offset, at query time, is not normally used. If you have a document in the index that has "nat=C3=BCrlicher wald" then it would have offsets X to X+18, stored in the index ideally as postings offsets, and should highlight correctly? Mike McCandless http://blog.mikemccandless.com > Am 13.02.2017 um 14:24 schrieb Michael McCandless: >> Unfortunately, I cannot reproduce the problem with a straight Lucene >> test case. I added a this test case to TestSynonymGraphFilter.java: >> >> https://gist.github.com/mikemccand/318459ca507742052688e2fe800a10dd >> >> And when I run it, it produces the correct token graph: >> >> TOKEN: naturwald >> offset: 0-14 >> pos: 0-4 >> type: SYNONYM >> >> TOKEN: for=C3=AAt >> offset: 0-14 >> pos: 0-1 >> type: SYNONYM >> >> TOKEN: nat=C3=BCrlicher >> offset: 0-14 >> pos: 0-2 >> type: SYNONYM >> >> TOKEN: natural >> offset: 0-7 >> pos: 0-3 >> type: word >> >> TOKEN: naturelle >> offset: 0-14 >> pos: 1-4 >> type: SYNONYM >> >> TOKEN: wald >> offset: 0-14 >> pos: 2-4 >> type: SYNONYM >> >> TOKEN: forest >> offset: 8-14 >> pos: 3-4 >> type: word >> >> Remember that the "pos: " output above is really "node IDs" and you >> can see the inserted side paths are correct. The offsets are >> necessarily always 0-14 for inserted tokens because that is the span >> of the two original tokens. >> >> Can you try removing the SPF filters in your test? Or otherwise >> simplify your test so it's closer to what my test case is doing? >> >> Mike McCandless >> >> http://blog.mikemccandless.com >> >> On Mon, Feb 13, 2017 at 7:52 AM, Michael McCandless >> wrote: >>> Thanks Bernd; I'll see if I can make a test case from this. >>> >>> Mike McCandless >>> >>> http://blog.mikemccandless.com >>> >>> >>> On Mon, Feb 13, 2017 at 5:00 AM, Bernd Fehling >>> wrote: >>>> My very simple and small sysonym_test.txt has only one line: >>>> naturwald, natural\ forest, for=C3=AAt\ naturelle, nat=C3=BCrlicher\ w= ald >>>> >>>> If I only use WT (WhitespaceTokenizer) and SGF (with WhitespaceTokeniz= er) >>>> the result is: >>>> >>>> WT text start end positionLength type position >>>> natural 0 7 1 word 1 >>>> forest 8 14 1 word 2 >>>> >>>> SGF text start end positionLength type position >>>> natural 0 7 3 word 1 >>>> naturelle 0 14 3 SYNONYM 2 >>>> wald 0 14 2 SYNONYM 3 >>>> naturwald 0 14 4 SYNONYM 1 >>>> for=C3=AAt 0 14 1 SYNONYM 1 >>>> nat=C3=BCrlicher 0 14 2 SYNONYM 1 >>>> >>>> forest 8 14 1 word 4 >>>> >>>> The result is some kind of rubbish. >>>> Also note the empty line between "nat=C3=BCrlicher" and "forest". >>>> >>>> Anything else I should try, may be with KeywordTokenizer? >>>> >>>> p.s. You might have noticed the SPF filters in my setup. >>>> First is SynonymPreFilter to set all attributes to the right valu= e, >>>> second is SynonymPostFilter to again fix all attribute settings b= ut >>>> also set multi-word synonyms as phrase and also cleanup the resul= t >>>> of SGF. >>>> >>>> Regards >>>> Bernd >>>> >>>> Am 11.02.2017 um 00:45 schrieb Michael McCandless: >>>>> Yeah, those tokens should have position length 2. >>>>> >>>>> Can you reduce to a small set of synonyms and text? If you use only >>>>> whitespace tokenizer and SGF does the issue reproduce? >>>>> >>>>> Mike McCandless >>>>> >>>>> http://blog.mikemccandless.com >>>>> >>>>> >>>>> On Fri, Feb 10, 2017 at 10:07 AM, Bernd Fehling >>>>> wrote: >>>>>> Example for position end and positionLength of SGF. >>>>>> >>>>>> query: natural forest >>>>>> >>>>>> WT text start end positionLength type position >>>>>> natural 0 7 1 word 1 >>>>>> forest 8 14 1 word 2 >>>>>> ... >>>>>> >>>>>> SPF text start end positionLength type position >>>>>> natural 0 7 1 word 1 >>>>>> natural forest 0 14 2 shingle 2 >>>>>> forest 8 14 1 word 3 >>>>>> >>>>>> SGF text start end positionLength type position >>>>>> natural 0 7 1 word 1 >>>>>> naturwald 0 14 1 SYNONYM 2 >>>>>> for=C3=AAt naturelle 0 14 1 SYNONYM 2 >>>>>> nat=C3=BCrlicher wald 0 14 1 SYNONYM 2 >>>>>> natural forest 0 14 1 shingle 2 >>>>>> forest 8 14 1 word 3 >>>>>> >>>>>> SPF text start end positionLength type position >>>>>> natural 0 7 1 word 1 >>>>>> naturwald 0 9 1 SYNONYM 2 >>>>>> "for=C3=AAt naturelle" 0 17 2 SYNONYM 2 >>>>>> "nat=C3=BCrlicher wald" 0 18 2 SYNONYM 2 >>>>>> "natural forest" 0 16 2 shingle 2 >>>>>> forest 8 14 1 word 3 >>>>>> >>>>>> >>>>>> SGF (SynonymsGraphFilter) has for all SYNONYM's the same position en= d and positionLength. >>>>>> I suppose that it is not correct? >>>>>> >>>>>> Regards >>>>>> Bernd >>>>>> >>>>>> >>>>>> Am 09.02.2017 um 18:39 schrieb Michael McCandless: >>>>>>> On Thu, Feb 9, 2017 at 2:40 AM, Bernd Fehling >>>>>>> wrote: >>>>>>>> I tried SynonymGraphFilter with my setup and it works right away. >>>>>>>> It payed of that I did some modifications on my filters while >>>>>>>> testing 6.3 with my setup. >>>>>>> >>>>>>> Good! >>>>>>> >>>>>>>> I only replaced SynonymFilter with SynonymGraphFilter and did not >>>>>>>> use FlattenGraphFilter, pretty simple. So I can confirm that, up >>>>>>>> to this point, SynonymGraphFilter is a full replacement for >>>>>>>> SynonymFilter. At least for search-time synonym handling. >>>>>>>> >>>>>>>> But this also means there is still some work with the attributes, = right? >>>>>>>> Position looks good, type and start are no problem anyway, but >>>>>>>> the end position is still wrong and the positionLength for multi-w= ord >>>>>>>> synonyms. >>>>>>> >>>>>>> Can you give an example or make a small test case? >>>>>>> PositionLengthAttribute is supposed to be correct coming out of >>>>>>> SynonymGraphFilter. >>>>>>> >>>>>>>> One thing I noticed was that the originating token which "produces= " >>>>>>>> synonyms comes out last from SynonymGraphFilter, after the >>>>>>>> "produced" synonyms. >>>>>>>> I will have a look inside with debugger but I guess this is due >>>>>>>> to output buffering of SynonymGraphFilter? >>>>>>> >>>>>>> Yeah they do come out in a different order, which token filters are >>>>>>> allowed to do in general for all tokens leaving from the same posit= ion >>>>>>> ... >>>>>>> >>>>>>> Mike McCandless >>>>>>> >>>>>>> http://blog.mikemccandless.com >>>>>>> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org