lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: SynonymFilterFactory deprecated since 6.4.0
Date Mon, 13 Feb 2017 13:24:04 GMT
Unfortunately, I cannot reproduce the problem with a straight Lucene
test case.  I added a this test case to TestSynonymGraphFilter.java:

    https://gist.github.com/mikemccand/318459ca507742052688e2fe800a10dd

And when I run it, it produces the correct token graph:

TOKEN: naturwald
  offset: 0-14
  pos: 0-4
  type: SYNONYM

TOKEN: forêt
  offset: 0-14
  pos: 0-1
  type: SYNONYM

TOKEN: natürlicher
  offset: 0-14
  pos: 0-2
  type: SYNONYM

TOKEN: natural
  offset: 0-7
  pos: 0-3
  type: word

TOKEN: naturelle
  offset: 0-14
  pos: 1-4
  type: SYNONYM

TOKEN: wald
  offset: 0-14
  pos: 2-4
  type: SYNONYM

TOKEN: forest
  offset: 8-14
  pos: 3-4
  type: word

Remember that the "pos: " output above is really "node IDs" and you
can see the inserted side paths are correct.  The offsets are
necessarily always 0-14 for inserted tokens because that is the span
of the two original tokens.

Can you try removing the SPF filters in your test?  Or otherwise
simplify your test so it's closer to what my test case is doing?

Mike McCandless

http://blog.mikemccandless.com

On Mon, Feb 13, 2017 at 7:52 AM, Michael McCandless
<lucene@mikemccandless.com> wrote:
> Thanks Bernd; I'll see if I can make a test case from this.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Mon, Feb 13, 2017 at 5:00 AM, Bernd Fehling
> <bernd.fehling@uni-bielefeld.de> wrote:
>> My very simple and small sysonym_test.txt has only one line:
>> naturwald, natural\ forest, forêt\ naturelle, natürlicher\ wald
>>
>> If I only use WT (WhitespaceTokenizer) and SGF (with WhitespaceTokenizer)
>> the result is:
>>
>> WT      text     start  end  positionLength  type  position
>>      natural     0      7    1               word  1
>>       forest     8      14   1               word  2
>>
>> SGF     text     start  end  positionLength  type     position
>>      natural     0      7    3               word     1
>>    naturelle     0      14   3               SYNONYM  2
>>         wald     0      14   2               SYNONYM  3
>>    naturwald     0      14   4               SYNONYM  1
>>        forêt     0      14   1               SYNONYM  1
>>  natürlicher     0      14   2               SYNONYM  1
>>
>>       forest     8      14   1               word     4
>>
>> The result is some kind of rubbish.
>> Also note the empty line between "natürlicher" and "forest".
>>
>> Anything else I should try, may be with KeywordTokenizer?
>>
>> p.s. You might have noticed the SPF filters in my setup.
>>      First is SynonymPreFilter to set all attributes to the right value,
>>      second is SynonymPostFilter to again fix all attribute settings but
>>      also set multi-word synonyms as phrase and also cleanup the result
>>      of SGF.
>>
>> Regards
>> Bernd
>>
>> Am 11.02.2017 um 00:45 schrieb Michael McCandless:
>>> Yeah, those tokens should have position length 2.
>>>
>>> Can you reduce to a small set of synonyms and text?  If you use only
>>> whitespace tokenizer and SGF does the issue reproduce?
>>>
>>> Mike McCandless
>>>
>>> http://blog.mikemccandless.com
>>>
>>>
>>> On Fri, Feb 10, 2017 at 10:07 AM, Bernd Fehling
>>> <bernd.fehling@uni-bielefeld.de> wrote:
>>>> Example for position end and positionLength of SGF.
>>>>
>>>> query: natural forest
>>>>
>>>> WT      text     start  end  positionLength  type  position
>>>>         natural  0      7    1               word  1
>>>>         forest   8      14   1               word  2
>>>> ...
>>>>
>>>> SPF     text     start  end  positionLength  type     position
>>>>         natural  0      7    1               word     1
>>>>  natural forest  0      14   2               shingle  2
>>>>         forest   8      14   1               word     3
>>>>
>>>> SGF     text     start  end  positionLength  type     position
>>>>         natural  0      7    1               word     1
>>>>       naturwald  0      14   1               SYNONYM  2
>>>> forêt naturelle  0      14   1               SYNONYM  2
>>>> natürlicher wald 0      14   1               SYNONYM  2
>>>>  natural forest  0      14   1               shingle  2
>>>>          forest  8      14   1               word     3
>>>>
>>>> SPF     text     start  end  positionLength  type     position
>>>>         natural  0      7    1               word     1
>>>>       naturwald  0      9    1               SYNONYM  2
>>>> "forêt naturelle"  0    17   2               SYNONYM  2
>>>> "natürlicher wald" 0    18   2               SYNONYM  2
>>>> "natural forest" 0      16   2               shingle  2
>>>>          forest  8      14   1               word     3
>>>>
>>>>
>>>> SGF (SynonymsGraphFilter) has for all SYNONYM's the same position end and
positionLength.
>>>> I suppose that it is not correct?
>>>>
>>>> Regards
>>>> Bernd
>>>>
>>>>
>>>> Am 09.02.2017 um 18:39 schrieb Michael McCandless:
>>>>> On Thu, Feb 9, 2017 at 2:40 AM, Bernd Fehling
>>>>> <bernd.fehling@uni-bielefeld.de> wrote:
>>>>>> I tried SynonymGraphFilter with my setup and it works right away.
>>>>>> It payed of that I did some modifications on my filters while
>>>>>> testing 6.3 with my setup.
>>>>>
>>>>> Good!
>>>>>
>>>>>> I only replaced SynonymFilter with SynonymGraphFilter and did not
>>>>>> use FlattenGraphFilter, pretty simple. So I can confirm that, up
>>>>>> to this point, SynonymGraphFilter is a full replacement for
>>>>>> SynonymFilter. At least for search-time synonym handling.
>>>>>>
>>>>>> But this also means there is still some work with the attributes,
right?
>>>>>> Position looks good, type and start are no problem anyway, but
>>>>>> the end position is still wrong and the positionLength for multi-word
>>>>>> synonyms.
>>>>>
>>>>> Can you give an example or make a small test case?
>>>>> PositionLengthAttribute is supposed to be correct coming out of
>>>>> SynonymGraphFilter.
>>>>>
>>>>>> One thing I noticed was that the originating token which "produces"
>>>>>> synonyms comes out last from SynonymGraphFilter, after the
>>>>>> "produced" synonyms.
>>>>>> I will have a look inside with debugger but I guess this is due
>>>>>> to output buffering of SynonymGraphFilter?
>>>>>
>>>>> Yeah they do come out in a different order, which token filters are
>>>>> allowed to do in general for all tokens leaving from the same position
>>>>> ...
>>>>>
>>>>> Mike McCandless
>>>>>
>>>>> http://blog.mikemccandless.com
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>
>>>>
>>>> --
>>>> *************************************************************
>>>> Bernd Fehling                    Bielefeld University Library
>>>> Dipl.-Inform. (FH)                LibTec - Library Technology
>>>> Universitätsstr. 25                  and Knowledge Management
>>>> 33615 Bielefeld
>>>> Tel. +49 521 106-4060       bernd.fehling(at)uni-bielefeld.de
>>>>
>>>> BASE - Bielefeld Academic Search Engine - www.base-search.net
>>>> *************************************************************
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>
>> --
>> *************************************************************
>> Bernd Fehling                    Bielefeld University Library
>> Dipl.-Inform. (FH)                LibTec - Library Technology
>> Universitätsstr. 25                  and Knowledge Management
>> 33615 Bielefeld
>> Tel. +49 521 106-4060       bernd.fehling(at)uni-bielefeld.de
>>
>> BASE - Bielefeld Academic Search Engine - www.base-search.net
>> *************************************************************
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message