lucene-solr-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <rcm...@gmail.com>
Subject protwords.txt support in stemmers
Date Tue, 30 Mar 2010 12:06:27 GMT
Hello Solr devs,

One thing we did recently in lucene that I would like to expose in Solr, is
add support for "protected words" to all stemmers.

So the way this works is that a TokenStream attribute 'KeywordAttribute' is
set, and all the stemfilters know to ignore tokens with this boolean value
set.

We also added two neat tokenfilters that make this easy to use:
* KeywordMarkerFilter: a tokenfilter, that given a set of input words, marks
them as keywords with this attribute so any later stemmer ignores them.
* StemmerOverrideFilter: a tokenfilter, that given a map of input
words->stems, stems them with the dictionary, and marks them as keywords so
any later stemmer ignores them.

We have two choices:
* we could treat this stuff as impl details, and add protwords.txt support
to all stemming factories. we could just wrap the filter with a
keywordmarkerfilter internally.
* we could deprecate the explicit protwords.txt in the few factories that
support it, and instead create a factory for KeywordMarkerFilter.
* we could do something else, e.g. both.

So, to illustrate, by adding a factory for the KeywordMarkerFilter, a user
could do:

<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.SomeStemmer"/>

and get the same effect, instead of having to add support for protwords.txt
to every single stem factory.

I don't really have a personal preference as to how we do it, but it would
be cool to have a plan so we can add these factories and clean a few things
up.

In any event I think we should add a factory for the StemmerOverrideFilter,
so someone can have a text file with exceptions, the dutch handling for
"fiets" comes to mind.

Thanks

-- 
Robert Muir
rcmuir@gmail.com

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message