lucene-solr-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <rcm...@gmail.com>
Subject Re: protwords.txt support in stemmers
Date Tue, 30 Mar 2010 14:07:28 GMT
On Tue, Mar 30, 2010 at 8:33 AM, Yonik Seeley <yonik@lucidimagination.com>wrote:

> On Tue, Mar 30, 2010 at 8:06 AM, Robert Muir <rcmuir@gmail.com> wrote:
> > We have two choices:
> > * we could treat this stuff as impl details, and add protwords.txt
> support
> > to all stemming factories. we could just wrap the filter with a
> > keywordmarkerfilter internally.
> > * we could deprecate the explicit protwords.txt in the few factories that
> > support it, and instead create a factory for KeywordMarkerFilter.
> > * we could do something else, e.g. both.
> >
> > So, to illustrate, by adding a factory for the KeywordMarkerFilter, a
> user
> > could do:
> >
> > <filter class="solr.KeywordMarkerFilterFactory"
> protected="protwords.txt"/>
> > <filter class="solr.SomeStemmer"/>
> >
> > and get the same effect, instead of having to add support for
> protwords.txt
> > to every single stem factory.
>
> Yep, this decomposition seems more powerful.
>
> Sort of related: for a long time I've had the idea of allowing the
> expression of more complex filter chains that can conditionally
> execute some parts based on tags set by other parts.
>
> This is straightforward to just hand-code in Java of course, but
> trickier to do well in a declarative setting:
>
>  <filter class="solr.Tagger" tag="protect" words="protwords.txt"/>
>  <filter class="solr.SomeStemmer" skipTags="protect"/>
>
> The idea was to also make this fast by allocating a bit per tag
> (assuming we somehow knew all of the possible ones in a particular
> filter chain) and using a bitfield (long) to set and test.  I was
> planning on using Token.flags before the new analysis attribute stuff
> came into being.
>
> It would also be nice to make the token categories generated by
> tokenizers into tags (like StandardTokenizer's ACRONYM, etc).  A
> tokenizer that detected many of the properties could significantly
> speed up analysis because tokens would not have to be re-analyzed to
> see if they contain mixed case, numbers, hyphens, etc (i.e. the fast
> path for WDF would be checking a bit per token).
>
> Anyway, probably something for another day, but I wanted to throw it out
> there.
>
> -Yonik
> http://www.lucidimagination.com
>

Sorta unrelated too, but on the same topic of performance, I'd really like
to improve the indexing speed with the example schema, and thats my hidden
motivation here.

I think we've already significantly improved WDF and SnowballPorter
performance in trunk, but if we add this support we could at least consider
switching to the much much faster PorterStemmer in the Lucene core for the
example schema, as it would then support protected words via this mechanism.

Do you have a preferred way to benchmark type "text" for example? Ideally in
the future the lucene benchmark package could support benchmarking Solr
schema definitions... but we aren't there yet!

-- 
Robert Muir
rcmuir@gmail.com

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message