lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jan Høydahl (Updated) (JIRA) <>
Subject [jira] [Updated] (SOLR-2866) Marked synonym filter for selective token expansion
Date Mon, 31 Oct 2011 16:07:32 GMT


Jan Høydahl updated SOLR-2866:

          Component/s:     (was: SearchComponents - other)
                       Schema and Analysis
    Affects Version/s:     (was: 3.4)
        Fix Version/s:     (was: 3.4)
               Labels: stemming synonyms  (was: patch)
> Marked synonym filter for selective token expansion
> ---------------------------------------------------
>                 Key: SOLR-2866
>                 URL:
>             Project: Solr
>          Issue Type: Improvement
>          Components: Schema and Analysis
>         Environment: Solr 3.4
>            Reporter: Victor van der Wolf
>            Priority: Minor
>              Labels: stemming, synonyms
>             Fix For: 3.5
>         Attachments:,,
> Hi everybody,
> My name is Victor van der Wolf and since recently I work for the Royal Library in the
Netherlands. One of my first assignments here was to see if I could implement some stemming
algorithm for our websites. Our search engine is solr/lucene 3.4.
> Basically I had 2 requirements to work with:
> 1)       It should be possible to switch the stemming functionality on and off in the
front end
> 2)       No extra storage should be required (no extra indexing).
> I shortly came to the conclusion that it would be practical to use the SynonymFilter
to do that. I got hold of a dutch library and used a stemming algorithm to generate a synonym
file on that.
> Then I thought that I could maybe use 2 different query analyzers under the "field type"
and then call one or the other depending if I want stemming or not, like this q=<field>:<analyzer>:<search
term>. Unfortunately this did not seem possible.
> Then, after some discussions with Erick Erickson, it became clear that a good approach
could be to write my own SynonymFilter and apply some kind of token marking to decide it that
token should be "synonymized" or not. Well, I did just that and it works like a charm.
> I would like to contribute this MarkedSynonymFilter class to the project.
> I used the SynonymFilter class as a starting point and added some extra functionality
to that. First of all, I added 3 new parameters called lookup, preMark and postmark. The preMark
and postmark parameters contain some kind of pre- and suffix to recognize if a token should
be "synonymized" or not. A simple regex is used to determine this. Then the lookup parameter
determines the behaviour of the MarkedSynonymFilter:
> lookup=marked - marked tokens will be synonymized
> lookup=unmarked - unmarked tokens will be synonymized
> lookup=all - all tokens should be synonymized
> lookup=none - none of the tokens should be synonymized
> I started out writing this based on version 3.3, later I discovered that we were using
3.4 and I had to upgrade it. Unfortunately the whole SynonymFilter code has been revised and
for the moment there is the Slow and the Fast synonym filter where the Slow one if depricated.
My addition is based on the slow version I am afraid.
> Anyway, I am curious about your comments. Please let me know if I should go forward with
this and create a JIRA issue + my code as a patch.
> Cheers,
> Victor van der Wolf

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:!default.jspa
For more information on JIRA, see:


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message