lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From <>
Subject RE: solr and analyzers module
Date Wed, 19 May 2010 14:38:17 GMT
Nobody in their right mind can disagree with (1).  I should also point out that writing a custom
analyzer is a very typical activity (as is a custom scorer), so this should be made as straightforward
as is possible.


-----Original Message-----
From: ext Robert Muir [] 
Sent: Wednesday, May 19, 2010 9:51 AM
Subject: solr and analyzers module


I am doing some work to shuffle things around and consolidate
analyzers into what will hopefully be its own versioned module (such
that you could use an older version with a newer Lucene core and we
could remove "fake" Version and use real jar file versions).

For a while I have been thinking about how we might apply this to
Solr, so it gets the same benefit. At the same time, there are other
"problems" with analysis in Solr I would like to fix at the same time:

1. Solr, like Lucene, should be able to work with an older analyzers
module for backwards compatibility purposes.
2. Solr users should optionally be able to use analyzers that are not
in common (smartcn, stempel, icu, ...) easily. Currently this is a
tradeoff against the size of the solr war file (so they are not
included). At the same time it seems silly to make solr contribs for
'more analyzers'.

The current idea I have is that Solr would not include
analyzers-common.jar bundled into its war file at all. Instead, all
analyzers modules would also serve as plugins to Solr (you stick them
in solrhome/lib).  By default, Solr would just include
analyzers-common this way, instead of in the war file itself.

So with this idea, analyzers are just a Solr plugin, and the default
Solr install includes the ones it does today, so most users would not
see the difference. But if a user wants Polish, Smart Chinese, or
improved Unicode support, they would be able to drop in one of the
additional analyzer modules easily.

The factories for Solr serve as a buffer to hide the implementation
details, and I think they should be part of these analyzer modules, so
when you produce an analyzers artifact it is both a plugin to Lucene
and also a plugin to Solr. In my opinion, this factory interface is
very well defined and achieves for Solr <-> analyzers what we want to
achieve for Lucene <-> analyzers, a minimal interface.

Down the road, we could look at improving on this further, for example
any given release of analyzers artifacts could include additional
artifacts that "go with it":
1. example configuration files like stopwords lists for different languages
2. example schema definitions (even snippets) for Solr users as a
documentation artifact, so they know how to use this stuff.

Thoughts, alternatives proposals?

Robert Muir

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message