jackrabbit-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marcel Reutegger <marcel.reuteg...@gmx.net>
Subject Re: IndexingConfiguration jr 1.4 release, analyzing, searching and synonymprovider
Date Mon, 13 Aug 2007 13:19:58 GMT
Ard Schrijvers wrote:
> and sorry for spamming, but I just want to share my findings/impressions, and
> what I am posting I am willimg to implement and port to the JackRabbit trunk
> (so if you bother to read it, and are positive about it, I will implement it
> :-) )

you don't have to feel sorry, your input is very welcome!


> So, one part that bothers me, is multilinguality (with lang specific
> stopwords, stemming, synonyms). Many customers these days want multilingual
> sites, and search them accordingly. And, obviously, lucene has quite some
> code for exactly this : see contrib/analyzers/src/java.
> Obviously, lucene has many more analyzers, and you can easily add your own.
> AFAIU, there is a single configuration place where I can define the overall
> JackRabbit analyzer that is used within one workspace:
> in repository.xml :
> <param name="analyzer"
> value="org.apache.lucene.analysis.standard.StandardAnalyzer"/>
> but, what I want, is a per property defineable analyzer (I would give bode_fr
> a french analyzer, body_de a german, some properties i might want to be
> indexed with keyword analyzers, like zipcodes). The best place for this IMO,
> is the IndexingConfiguration: then, if you do not configure it, nothing
> changes for you.
> So, for example the first index rule at
> http://wiki.apache.org/jackrabbit/IndexingConfiguration would change in:
> <index-rule nodeType="nt:unstructured" boost="2.0"> <property
> analyzer="org.apache.lucene.analysis.Analyzer.GermanAnalyzer">text_de</property>
>  </index-rule>
> and during loading, we construct a Map of {jr-property,analyzer} (call it
> propertyAnalyzerMap). Then, all we need to add is one jackrabbit global
> analyzer, that look like:
> class JRAnalyzer extends Analyzer { Analyzer defaultAnalyzer = new
> StandardAnalyzer();
> public TokenStream tokenStream(String fieldName, Reader reader) { Analyzer
> analyzer = (Analyzer)propertyAnalyzerMap.get(fieldName); if(analyzer!=null){ 
> return analyzer.tokenStream(fieldName, reader); }else{ return
> this.defaultAnalyzer.tokenStream(fieldName, reader); } } }
> This very same JRAnalyzer is also used for the QueryParser in
> LuceneQueryBuilder, so this will work also for searching IIUC. So, WDOT? I
> can implement it and send a patch, but if the community is reluctant to it, I
> will have to do it for myself in a non jr code intrusive way.

This would work quite well for jcr:contains functions that operate on a 
property. However I'm not sure what to do with this:

//*[jcr:contains(., 'h├Ągar')]

the node scope does not indicate which analyzer to use for the query statement. 
Would we just run the statement through all analyzers and combine them in an OR 

> Example of the SynonymProvider mentioned at the top:
> If my suggested changes are accepted, things like a SynonymProvider becomes
> superfluous, and very easy to add on the fly:
> suppose, I want on the "body" property of my nodes always full searching with
> dutch synonyms. This boils down to adding an analyzer for this property, that
> extends the DutchAnalyzer in lucene, and that adds synonym functionality
> (very simple example in "lucene in action" book). I think it is better to do
> synonyms during analyzing (as opposed to the SynonymProvider in jr trunk),
> and simply use an analyzer for it. Ofcourse, a difference of using it, would
> be that with the current SynonymProvider you specifically have to define that
> you do a synonymsearch (~term), while with an analyzer, you define which
> properties whould be indexed with an synonymanalyzer, and searched
> accordingly (without having to specify it),

well, those are actually the reasons why I implemented it the other way. If you 
go the analyzer way to expand synonyms you have to re-index the complete content 
if you want to add a single synonym. I also wanted the user to decide if 
synonyms should be considered. Again this would not be possible if the analyzer 
automatically adds synonyms.

but fortunately, with jackrabbit both is possible ;) if one prefers to expand 
terms on index time, just use an appropriate analyzer and don't configure a 

> So WDOT? Again, sry for mailing so much, just trying to sell my ideas :-)

again, your ideas are very welcome.


View raw message