lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler" <...@thetaphi.de>
Subject RE: Is "solr.AnalyzerName" expansion supposed to work for Analyzers?
Date Sat, 10 Sep 2016 15:03:54 GMT
To add,

the manages schema really makes it easy to "rewrite". My plan would be:

- Add a new "type" or "name" attribute to schema.xml, which is contrary to "class" attribute
usage
- When a manages schema is loaded, the resolving of classes using the hack is done as it is
now. Warnings are printed as said before.
- The managed schema is then changes to switch to the new attribute (there is a getter to
get the symbolic name from the factory, so rewriting is easy)

In addition, this simplifies usage: Some GUI could show a dropdown list for clicking together
the analyzer. We just need to add a schema-REST endpoint to get all names.

Maybe open an issue targeted for 6.x / 7.0. I'd be happy to help to fix this, although I could
only do the SolrResourceLoader and SolrAnalyzer stuff.

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de

> -----Original Message-----
> From: Uwe Schindler [mailto:uwe@thetaphi.de]
> Sent: Saturday, September 10, 2016 4:03 PM
> To: dev@lucene.apache.org; Alexandre Rafalovitch <arafalov@gmail.com>
> Subject: Re: Is "solr.AnalyzerName" expansion supposed to work for
> Analyzers?
> 
> Hi,
> 
> The registry is there. To get all symbolic names of analyzer components in
> classpath, use XxxFacrory.availableXxx() static methods.
> 
> I don't think it makes sense to replace all factories in solr with named SPIs.
> But I'd suggest to add the type or name attribute to analysis components and
> promote it. Class attribute can still be used like now but logs warning if it was
> misused to load an SPI. If it refers to a real class all is fine.
> 
> Uwe
> 
> Am 10. September 2016 15:56:51 MESZ, schrieb Alexandre Rafalovitch
> <arafalov@gmail.com>:
> >Wow Uwe,
> >
> >Thanks for the treatise. That's an interesting discussion, but I
> >wonder if anything changed since?
> >
> >In terms of user-confusion/migration, we now have managed schema and
> >can probably rewrite from 'solr.x' to symbol names on first use. That,
> >of course, requires some sort of registry of those names, which I am
> >not sure if it exists (apart from my own solrt-start.com hacks). But
> >then the registry may well align with some other configuration
> >reporting by the components. And with plugins/library jars.
> >
> >I am also wondering if the objection is still valid that other
> >components in Solr (such as search components) are still not able to
> >move to SPI? I am especially curious if any of that was affected by
> >Nobble's work on having libraries loaded into Solr's special
> >collection. What is the mechanism used there to load things.
> >
> >But yes, I can see it is a big topic. I may just update the
> >documentation and examples to mention that Analyzers have to use
> >full-name when I get to it.
> >
> >Regards,
> >   Alex.
> >----
> >Newsletter and resources for Solr beginners and intermediates:
> >http://www.solr-start.com/
> >
> >
> >On 10 September 2016 at 14:24, Uwe Schindler <uwe@thetaphi.de> wrote:
> >> Hallo Alexandre,
> >>
> >>> I can't see a reason why it should be different, but:
> >>>
> >>> This works
> >>>     <fieldType name="text_basic" class="solr.TextField">
> >>>         <analyzer>
> >>>             <tokenizer class="solr.LowerCaseTokenizerFactory" />
> >>>         </analyzer>
> >>>    </fieldType>
> >>>
> >>> This does not:
> >>>     <fieldType name="text_basic" class="solr.TextField">
> >>>         <analyzer class="solr.SimpleAnalyzer"/>
> >>>     </fieldType>
> >>>
> >>> This does work again:
> >>>     <fieldType name="text_basic" class="solr.TextField">
> >>>         <analyzer
> >class="org.apache.lucene.analysis.core.SimpleAnalyzer"/>
> >>>     </fieldType>
> >>>
> >>> Both LowerCaseTokenizerFactory and SimpleAnalyzer are in the same
> >>> package.
> >>>
> >>> Is this a bug or some sort of legacy decision?
> >>
> >> There is a long history behind that and there is also a *fundamental*
> >difference between the factories used for building custom analyzers in
> >XML code and just referring to an Analyzer!
> >>
> >> Let me start with some history: From the early beginning there was
> >the concept of factories in Solr, so implementation classes are
> >initialized from a map of properties given in the XML. Those factories
> >were specified by Java binary class name
> >("org.apache.solr.foo.bar.MyFactory"). This is used at many places in
> >Solr. The problem is that those class names could be quite long, so the
> >SolrResourceLoader has a "hack" to allow short names (IMHO, which was a
> >horrible decision). When it sees a class starting with name "solr.", it
> >tris to lookup different possibilities. See code here:
> >https://goo.gl/P24ZU3 (subpackages is generally a list like
> >"o.a.solr.something",...).
> >>
> >> In the early days (before Lucene/Solr 4.0), those factories were
> >*all* part of Solr, so the lookup with the "solr." short name prefix
> >was easy and the subpackages list was short. So it "just worked" and
> >many people had those class names in their config files.
> >>
> >> The Analyzers (2nd example) were always referred to by their full
> >name, because they were part of Lucene and not Solr. Using a "solr."
> >Short name was never ever possible because of that.
> >>
> >> Now a change in 4.0 comes into the game: To make the concept of
> >building "custom" analyzers easier to use for non-Solr users, and to
> >make the whole concept easier to maintain, the factories for
> >tokenstream components were moved out of Solr into Lucene
> >(https://issues.apache.org/jira/browse/LUCENE-2510). The analysis parts
> >got new package names below the Lucene namespace. The effect of this
> >would have been that all people have to change their config files,
> >because the "solr." Shortcut won't work with Lucene classes.
> >>
> >> Now you might ask why the "solr." Prefix still works? The reason is a
> >second fundamental change with Lucene 4. We no longer use class names
> >in Lucene to refer to stuff like Codecs, PostingFormats - we use the
> >java concept of SPI. All components get a name, the implementation
> >class is not exposed to outside. Like with Codecs, where you use
> >Codec.forName("Lucene70") to instantiate it, the same was done for
> >TokenStream components. This allows now to create
> >StandardTokenizerFactory using the following code:
> >TokenizerFactory.forName("standard"). Or LowercaseFilter with
> >TokenFilterFactory.forName("lowercase"). There is no such concept for
> >Analyzers (no SPI) [this explains your original question].
> >>
> >> Now we have the two pieces to put together: Refactoring of class
> >names and adding of SPI concept. The "correct" fix in Solr would have
> >been to remove the "class=" attribute in the fieldType and replace by
> >something called "name" or "type", so the XML would look like
> >(https://goo.gl/Dr3gpO):
> >>
> >> <fieldType name="something " class="solr.TextField">
> >>    <analyzer>
> >>       <tokenizer name="whitespace" />
> >>    </analyzer>
> >> </fieldType>
> >>
> >> Similar to those examples of the corresponding class to build
> >Analyzers from those SPI names in Lucene:
> >https://lucene.apache.org/core/6_2_0/analyzers-
> common/org/apache/lucene/analysis/custom/CustomAnalyzer.html
> >>
> >> The above syntax is wonderful, but again this caused lots of
> >complaints from Solr developers, that people are unable to understand
> >this WTF :-) It may also have to do with those short names look more
> >like <add competitors name here>  analysis component names.... (no
> >idea, although its completely unrelated). The issue with more history
> >is here: https://issues.apache.org/jira/browse/LUCENE-4044
> >>
> >> Because of that there was a second hack added so all schema.xml files
> >worked like before (in LUCENE-4044). This hack is the only way to
> >configure tokenstream components up to this day - which is a desaster,
> >IMHO! The hack is a fancy regular expression that tries to convert the
> >old "solr.FoobarTokenFilterFactory" to the nice reading "names" like
> >above: https://goo.gl/mtWmjm
> >> The factory is then loaded using SPI: https://goo.gl/EwDtQr
> >> IMHO, the hack should be deprecated and removed and the new syntax,
> >as described above, should be introduced.
> >>
> >> Analyzer class names would still (and will for sure stay like that -
> >as used seldom in Solr) be *full* class names. There is no way to
> >change that!
> >>
> >> Now you have a bit of history and you might see that there is
> >absolutely no relationship between the class name / package name and
> >the configured "class" in schema.xml. In fact, the thing above cannot
> >be fixed. Instead, the issue mentioned before should finally be fixed
> >and the "class" attribute in token stream components be deprecated and
> >removed and the above "name" (or maybe "type") syntax be used.
> >>
> >> Uwe
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: dev-help@lucene.apache.org
> >>
> >
> >---------------------------------------------------------------------
> >To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >For additional commands, e-mail: dev-help@lucene.apache.org
> 
> --
> Uwe Schindler
> H.-H.-Meier-Allee 63, 28213 Bremen
> http://www.thetaphi.de
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message