lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alexandre Rafalovitch <arafa...@gmail.com>
Subject Re: Is "solr.AnalyzerName" expansion supposed to work for Analyzers?
Date Sat, 10 Sep 2016 13:56:51 GMT
Wow Uwe,

Thanks for the treatise. That's an interesting discussion, but I
wonder if anything changed since?

In terms of user-confusion/migration, we now have managed schema and
can probably rewrite from 'solr.x' to symbol names on first use. That,
of course, requires some sort of registry of those names, which I am
not sure if it exists (apart from my own solrt-start.com hacks). But
then the registry may well align with some other configuration
reporting by the components. And with plugins/library jars.

I am also wondering if the objection is still valid that other
components in Solr (such as search components) are still not able to
move to SPI? I am especially curious if any of that was affected by
Nobble's work on having libraries loaded into Solr's special
collection. What is the mechanism used there to load things.

But yes, I can see it is a big topic. I may just update the
documentation and examples to mention that Analyzers have to use
full-name when I get to it.

Regards,
   Alex.
----
Newsletter and resources for Solr beginners and intermediates:
http://www.solr-start.com/


On 10 September 2016 at 14:24, Uwe Schindler <uwe@thetaphi.de> wrote:
> Hallo Alexandre,
>
>> I can't see a reason why it should be different, but:
>>
>> This works
>>     <fieldType name="text_basic" class="solr.TextField">
>>         <analyzer>
>>             <tokenizer class="solr.LowerCaseTokenizerFactory" />
>>         </analyzer>
>>    </fieldType>
>>
>> This does not:
>>     <fieldType name="text_basic" class="solr.TextField">
>>         <analyzer class="solr.SimpleAnalyzer"/>
>>     </fieldType>
>>
>> This does work again:
>>     <fieldType name="text_basic" class="solr.TextField">
>>         <analyzer class="org.apache.lucene.analysis.core.SimpleAnalyzer"/>
>>     </fieldType>
>>
>> Both LowerCaseTokenizerFactory and SimpleAnalyzer are in the same
>> package.
>>
>> Is this a bug or some sort of legacy decision?
>
> There is a long history behind that and there is also a *fundamental* difference between
the factories used for building custom analyzers in XML code and just referring to an Analyzer!
>
> Let me start with some history: From the early beginning there was the concept of factories
in Solr, so implementation classes are initialized from a map of properties given in the XML.
Those factories were specified by Java binary class name ("org.apache.solr.foo.bar.MyFactory").
This is used at many places in Solr. The problem is that those class names could be quite
long, so the SolrResourceLoader has a "hack" to allow short names (IMHO, which was a horrible
decision). When it sees a class starting with name "solr.", it tris to lookup different possibilities.
See code here: https://goo.gl/P24ZU3 (subpackages is generally a list like "o.a.solr.something",...).
>
> In the early days (before Lucene/Solr 4.0), those factories were *all* part of Solr,
so the lookup with the "solr." short name prefix was easy and the subpackages list was short.
So it "just worked" and many people had those class names in their config files.
>
> The Analyzers (2nd example) were always referred to by their full name, because they
were part of Lucene and not Solr. Using a "solr." Short name was never ever possible because
of that.
>
> Now a change in 4.0 comes into the game: To make the concept of building "custom" analyzers
easier to use for non-Solr users, and to make the whole concept easier to maintain, the factories
for tokenstream components were moved out of Solr into Lucene (https://issues.apache.org/jira/browse/LUCENE-2510).
The analysis parts got new package names below the Lucene namespace. The effect of this would
have been that all people have to change their config files, because the "solr." Shortcut
won't work with Lucene classes.
>
> Now you might ask why the "solr." Prefix still works? The reason is a second fundamental
change with Lucene 4. We no longer use class names in Lucene to refer to stuff like Codecs,
PostingFormats - we use the java concept of SPI. All components get a name, the implementation
class is not exposed to outside. Like with Codecs, where you use Codec.forName("Lucene70")
to instantiate it, the same was done for TokenStream components. This allows now to create
StandardTokenizerFactory using the following code: TokenizerFactory.forName("standard"). Or
LowercaseFilter with TokenFilterFactory.forName("lowercase"). There is no such concept for
Analyzers (no SPI) [this explains your original question].
>
> Now we have the two pieces to put together: Refactoring of class names and adding of
SPI concept. The "correct" fix in Solr would have been to remove the "class=" attribute in
the fieldType and replace by something called "name" or "type", so the XML would look like
(https://goo.gl/Dr3gpO):
>
> <fieldType name="something " class="solr.TextField">
>    <analyzer>
>       <tokenizer name="whitespace" />
>    </analyzer>
> </fieldType>
>
> Similar to those examples of the corresponding class to build Analyzers from those SPI
names in Lucene: https://lucene.apache.org/core/6_2_0/analyzers-common/org/apache/lucene/analysis/custom/CustomAnalyzer.html
>
> The above syntax is wonderful, but again this caused lots of complaints from Solr developers,
that people are unable to understand this WTF :-) It may also have to do with those short
names look more like <add competitors name here>  analysis component names.... (no idea,
although its completely unrelated). The issue with more history is here: https://issues.apache.org/jira/browse/LUCENE-4044
>
> Because of that there was a second hack added so all schema.xml files worked like before
(in LUCENE-4044). This hack is the only way to configure tokenstream components up to this
day - which is a desaster, IMHO! The hack is a fancy regular expression that tries to convert
the old "solr.FoobarTokenFilterFactory" to the nice reading "names" like above: https://goo.gl/mtWmjm
> The factory is then loaded using SPI: https://goo.gl/EwDtQr
> IMHO, the hack should be deprecated and removed and the new syntax, as described above,
should be introduced.
>
> Analyzer class names would still (and will for sure stay like that - as used seldom in
Solr) be *full* class names. There is no way to change that!
>
> Now you have a bit of history and you might see that there is absolutely no relationship
between the class name / package name and the configured "class" in schema.xml. In fact, the
thing above cannot be fixed. Instead, the issue mentioned before should finally be fixed and
the "class" attribute in token stream components be deprecated and removed and the above "name"
(or maybe "type") syntax be used.
>
> Uwe
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message