lucene-solr-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yonik Seeley <yo...@lucidimagination.com>
Subject Re: protwords.txt support in stemmers
Date Tue, 30 Mar 2010 12:33:56 GMT
On Tue, Mar 30, 2010 at 8:06 AM, Robert Muir <rcmuir@gmail.com> wrote:
> We have two choices:
> * we could treat this stuff as impl details, and add protwords.txt support
> to all stemming factories. we could just wrap the filter with a
> keywordmarkerfilter internally.
> * we could deprecate the explicit protwords.txt in the few factories that
> support it, and instead create a factory for KeywordMarkerFilter.
> * we could do something else, e.g. both.
>
> So, to illustrate, by adding a factory for the KeywordMarkerFilter, a user
> could do:
>
> <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
> <filter class="solr.SomeStemmer"/>
>
> and get the same effect, instead of having to add support for protwords.txt
> to every single stem factory.

Yep, this decomposition seems more powerful.

Sort of related: for a long time I've had the idea of allowing the
expression of more complex filter chains that can conditionally
execute some parts based on tags set by other parts.

This is straightforward to just hand-code in Java of course, but
trickier to do well in a declarative setting:

 <filter class="solr.Tagger" tag="protect" words="protwords.txt"/>
 <filter class="solr.SomeStemmer" skipTags="protect"/>

The idea was to also make this fast by allocating a bit per tag
(assuming we somehow knew all of the possible ones in a particular
filter chain) and using a bitfield (long) to set and test.  I was
planning on using Token.flags before the new analysis attribute stuff
came into being.

It would also be nice to make the token categories generated by
tokenizers into tags (like StandardTokenizer's ACRONYM, etc).  A
tokenizer that detected many of the properties could significantly
speed up analysis because tokens would not have to be re-analyzed to
see if they contain mixed case, numbers, hyphens, etc (i.e. the fast
path for WDF would be checking a bit per token).

Anyway, probably something for another day, but I wanted to throw it out there.

-Yonik
http://www.lucidimagination.com

Mime
View raw message