Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
MIME-Version: 1.0
In-Reply-To: <ea453efe-4020-4c58-7165-d3e57d9b2928@uni-bielefeld.de>
References: <5a1c6576-319d-01be-b089-cd93dce5c2e1@uni-bielefeld.de>
 <15381_1486467299_v17BYvhS006975_CAL8Pwkb6MW5CQe8_-o-JGpUGOr1Z_s54-w2iHuXHCRf+aUA7yA@mail.gmail.com>
 <a21db878-9d9c-90e5-eb48-d90c337778aa@uni-bielefeld.de> <23977_1486492326_v17IW4CD018167_CAL8PwkZiSZvAAuvGfOp=Sao7fVZPZqzXeFJkQ1PO0tgCxypkvg@mail.gmail.com>
 <05326ffd-e005-2250-4d4e-0ea17f00d3ab@uni-bielefeld.de> <24386_1486661968_v19HdRZX013914_CAL8PwkYw4tqJ-fP6A2qA2xFmuoSHdpM+hSWiiQ0VQ2SvkRT66A@mail.gmail.com>
 <77580b04-c901-5e80-0b96-1bfe4639c779@uni-bielefeld.de> <19267_1486770399_v1ANkbwt029897_CAL8PwkYGYYKbJLiXkbtqV39oHxvMXBKs0jiWWrXuRQZJ-HxrzA@mail.gmail.com>
 <a6de7cc3-84d3-bf29-4041-438a215d385f@uni-bielefeld.de> <CAL8Pwkb7=jYi63RXtaGgDvbrbPtOqpqAEi_iUCxr+DWAY_nHSQ@mail.gmail.com>
 <8083_1486992274_v1DDOXHg017064_CAL8Pwkar_WBsz=JpVSHPvaFiSp5ve-9YVkBD5mD83XvO+DWvVw@mail.gmail.com>
 <ea453efe-4020-4c58-7165-d3e57d9b2928@uni-bielefeld.de>
From: Michael McCandless <lucene@mikemccandless.com>
Date: Mon, 13 Feb 2017 11:31:50 -0500
Message-ID: <CAL8PwkY6t9iMeft=ixYXSfwzUEXu_gQhti3ha_G8DQ0PhUm1Eg@mail.gmail.com>
Subject: Re: SynonymFilterFactory deprecated since 6.4.0
To: Lucene Users <java-user@lucene.apache.org>,
	Bernd Fehling <bernd.fehling@uni-bielefeld.de>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
archived-at: Mon, 13 Feb 2017 16:32:19 -0000

On Mon, Feb 13, 2017 at 9:04 AM, Bernd Fehling
<bernd.fehling@uni-bielefeld.de> wrote:

> Am I confused by the naming of pos, positionIncrement, offset, positionLe=
ngth,
> start and end between Lucene and Solr?

"pos" is just accumulating the positionIncrement values, starting from
-1.  I don't think Solr's analysis UI would change the meaning of
these attributes.

> OK, the SynonymGraphFilter is ONLY for Lucene, right?

No, it's also for Solr and Elasticsearch and any other search servers
on top of Lucene as well.

> But how are you going to build the multi-word synonym query "nat=C3=BCrli=
cher wald"
> from "natural forest"?

Lucene's and Elasticsearch's query parsers have already been fixed to
correctly handle token graphs by default; Solr has a fork of Lucene's
query parser I think ... I'm not sure if it's been fixed yet to
interpret graphs.

See e.g. https://issues.apache.org/jira/browse/LUCENE-7603 and
https://issues.apache.org/jira/browse/LUCENE-7638

> And how are you going to highlight a synonym hit for "nat=C3=BCrlicher wa=
ld"
> when start and end is set to 0-14 and not to 0-18?
> Or is start and end not used for highlighting?

This start/end offset, at query time, is not normally used.  If you
have a document in the index that has "nat=C3=BCrlicher wald" then it would
have offsets X to X+18, stored in the index ideally as postings
offsets, and should highlight correctly?

Mike McCandless

http://blog.mikemccandless.com

> Am 13.02.2017 um 14:24 schrieb Michael McCandless:
>> Unfortunately, I cannot reproduce the problem with a straight Lucene
>> test case.  I added a this test case to TestSynonymGraphFilter.java:
>>
>>     https://gist.github.com/mikemccand/318459ca507742052688e2fe800a10dd
>>
>> And when I run it, it produces the correct token graph:
>>
>> TOKEN: naturwald
>>   offset: 0-14
>>   pos: 0-4
>>   type: SYNONYM
>>
>> TOKEN: for=C3=AAt
>>   offset: 0-14
>>   pos: 0-1
>>   type: SYNONYM
>>
>> TOKEN: nat=C3=BCrlicher
>>   offset: 0-14
>>   pos: 0-2
>>   type: SYNONYM
>>
>> TOKEN: natural
>>   offset: 0-7
>>   pos: 0-3
>>   type: word
>>
>> TOKEN: naturelle
>>   offset: 0-14
>>   pos: 1-4
>>   type: SYNONYM
>>
>> TOKEN: wald
>>   offset: 0-14
>>   pos: 2-4
>>   type: SYNONYM
>>
>> TOKEN: forest
>>   offset: 8-14
>>   pos: 3-4
>>   type: word
>>
>> Remember that the "pos: " output above is really "node IDs" and you
>> can see the inserted side paths are correct.  The offsets are
>> necessarily always 0-14 for inserted tokens because that is the span
>> of the two original tokens.
>>
>> Can you try removing the SPF filters in your test?  Or otherwise
>> simplify your test so it's closer to what my test case is doing?
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>> On Mon, Feb 13, 2017 at 7:52 AM, Michael McCandless
>> <lucene@mikemccandless.com> wrote:
>>> Thanks Bernd; I'll see if I can make a test case from this.
>>>
>>> Mike McCandless
>>>
>>> http://blog.mikemccandless.com
>>>
>>>
>>> On Mon, Feb 13, 2017 at 5:00 AM, Bernd Fehling
>>> <bernd.fehling@uni-bielefeld.de> wrote:
>>>> My very simple and small sysonym_test.txt has only one line:
>>>> naturwald, natural\ forest, for=C3=AAt\ naturelle, nat=C3=BCrlicher\ w=
ald
>>>>
>>>> If I only use WT (WhitespaceTokenizer) and SGF (with WhitespaceTokeniz=
er)
>>>> the result is:
>>>>
>>>> WT      text     start  end  positionLength  type  position
>>>>      natural     0      7    1               word  1
>>>>       forest     8      14   1               word  2
>>>>
>>>> SGF     text     start  end  positionLength  type     position
>>>>      natural     0      7    3               word     1
>>>>    naturelle     0      14   3               SYNONYM  2
>>>>         wald     0      14   2               SYNONYM  3
>>>>    naturwald     0      14   4               SYNONYM  1
>>>>        for=C3=AAt     0      14   1               SYNONYM  1
>>>>  nat=C3=BCrlicher     0      14   2               SYNONYM  1
>>>>
>>>>       forest     8      14   1               word     4
>>>>
>>>> The result is some kind of rubbish.
>>>> Also note the empty line between "nat=C3=BCrlicher" and "forest".
>>>>
>>>> Anything else I should try, may be with KeywordTokenizer?
>>>>
>>>> p.s. You might have noticed the SPF filters in my setup.
>>>>      First is SynonymPreFilter to set all attributes to the right valu=
e,
>>>>      second is SynonymPostFilter to again fix all attribute settings b=
ut
>>>>      also set multi-word synonyms as phrase and also cleanup the resul=
t
>>>>      of SGF.
>>>>
>>>> Regards
>>>> Bernd
>>>>
>>>> Am 11.02.2017 um 00:45 schrieb Michael McCandless:
>>>>> Yeah, those tokens should have position length 2.
>>>>>
>>>>> Can you reduce to a small set of synonyms and text?  If you use only
>>>>> whitespace tokenizer and SGF does the issue reproduce?
>>>>>
>>>>> Mike McCandless
>>>>>
>>>>> http://blog.mikemccandless.com
>>>>>
>>>>>
>>>>> On Fri, Feb 10, 2017 at 10:07 AM, Bernd Fehling
>>>>> <bernd.fehling@uni-bielefeld.de> wrote:
>>>>>> Example for position end and positionLength of SGF.
>>>>>>
>>>>>> query: natural forest
>>>>>>
>>>>>> WT      text     start  end  positionLength  type  position
>>>>>>         natural  0      7    1               word  1
>>>>>>         forest   8      14   1               word  2
>>>>>> ...
>>>>>>
>>>>>> SPF     text     start  end  positionLength  type     position
>>>>>>         natural  0      7    1               word     1
>>>>>>  natural forest  0      14   2               shingle  2
>>>>>>         forest   8      14   1               word     3
>>>>>>
>>>>>> SGF     text     start  end  positionLength  type     position
>>>>>>         natural  0      7    1               word     1
>>>>>>       naturwald  0      14   1               SYNONYM  2
>>>>>> for=C3=AAt naturelle  0      14   1               SYNONYM  2
>>>>>> nat=C3=BCrlicher wald 0      14   1               SYNONYM  2
>>>>>>  natural forest  0      14   1               shingle  2
>>>>>>          forest  8      14   1               word     3
>>>>>>
>>>>>> SPF     text     start  end  positionLength  type     position
>>>>>>         natural  0      7    1               word     1
>>>>>>       naturwald  0      9    1               SYNONYM  2
>>>>>> "for=C3=AAt naturelle"  0    17   2               SYNONYM  2
>>>>>> "nat=C3=BCrlicher wald" 0    18   2               SYNONYM  2
>>>>>> "natural forest" 0      16   2               shingle  2
>>>>>>          forest  8      14   1               word     3
>>>>>>
>>>>>>
>>>>>> SGF (SynonymsGraphFilter) has for all SYNONYM's the same position en=
d and positionLength.
>>>>>> I suppose that it is not correct?
>>>>>>
>>>>>> Regards
>>>>>> Bernd
>>>>>>
>>>>>>
>>>>>> Am 09.02.2017 um 18:39 schrieb Michael McCandless:
>>>>>>> On Thu, Feb 9, 2017 at 2:40 AM, Bernd Fehling
>>>>>>> <bernd.fehling@uni-bielefeld.de> wrote:
>>>>>>>> I tried SynonymGraphFilter with my setup and it works right away.
>>>>>>>> It payed of that I did some modifications on my filters while
>>>>>>>> testing 6.3 with my setup.
>>>>>>>
>>>>>>> Good!
>>>>>>>
>>>>>>>> I only replaced SynonymFilter with SynonymGraphFilter and did not
>>>>>>>> use FlattenGraphFilter, pretty simple. So I can confirm that, up
>>>>>>>> to this point, SynonymGraphFilter is a full replacement for
>>>>>>>> SynonymFilter. At least for search-time synonym handling.
>>>>>>>>
>>>>>>>> But this also means there is still some work with the attributes, =
right?
>>>>>>>> Position looks good, type and start are no problem anyway, but
>>>>>>>> the end position is still wrong and the positionLength for multi-w=
ord
>>>>>>>> synonyms.
>>>>>>>
>>>>>>> Can you give an example or make a small test case?
>>>>>>> PositionLengthAttribute is supposed to be correct coming out of
>>>>>>> SynonymGraphFilter.
>>>>>>>
>>>>>>>> One thing I noticed was that the originating token which "produces=
"
>>>>>>>> synonyms comes out last from SynonymGraphFilter, after the
>>>>>>>> "produced" synonyms.
>>>>>>>> I will have a look inside with debugger but I guess this is due
>>>>>>>> to output buffering of SynonymGraphFilter?
>>>>>>>
>>>>>>> Yeah they do come out in a different order, which token filters are
>>>>>>> allowed to do in general for all tokens leaving from the same posit=
ion
>>>>>>> ...
>>>>>>>
>>>>>>> Mike McCandless
>>>>>>>
>>>>>>> http://blog.mikemccandless.com
>>>>>>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org