jackrabbit-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ard Schrijvers" <a.schrijv...@hippo.nl>
Subject IndexingConfiguration jr 1.4 release, analyzing, searching and synonymprovider
Date Wed, 08 Aug 2007 15:33:29 GMT
Hello, 

and sorry for spamming, but I just want to share my findings/impressions, and what I am posting
I am willimg to implement and port to the JackRabbit trunk (so if you bother to read it, and
are positive about it, I will implement it :-) )

(if you make it to the end of this mail, I also describe how simple it would become to add
a just in the trunk created SynonymProvider functionality....)

First of all, the IndexingConfiguration, very promising! Exactly what we need for better indexing,
and, consequently better search results. Because, in the end, what good is a repository when
customers can't find the results they are looking for? Storing, versioning, workflow, all
very important, but no good when nobody can find their content (duhh, obviously).

So, one part that bothers me, is multilinguality (with lang specific stopwords, stemming,
synonyms). Many customers these days want multilingual sites, and search them accordingly.
And, obviously, lucene has quite some code for exactly this : see contrib/analyzers/src/java.


Obviously, lucene has many more analyzers, and you can easily add your own. AFAIU, there is
a single configuration place where I can define the overall JackRabbit analyzer that is used
within one workspace: 

in repository.xml :

<param name="analyzer" value="org.apache.lucene.analysis.standard.StandardAnalyzer"/>

but, what I want, is a per property defineable analyzer (I would give bode_fr a french analyzer,
body_de a german, some properties i might want to be indexed with keyword analyzers, like
zipcodes). The best place for this IMO, is the IndexingConfiguration: then, if you do not
configure it, nothing changes for you.
 
So, for example the first index rule at http://wiki.apache.org/jackrabbit/IndexingConfiguration
would change in:

<index-rule nodeType="nt:unstructured"
              boost="2.0">
    <property analyzer="org.apache.lucene.analysis.Analyzer.GermanAnalyzer">text_de</property>
</index-rule>

and during loading, we construct a Map of {jr-property,analyzer} (call it propertyAnalyzerMap).
Then, all we need to add is one jackrabbit global analyzer, that look like:

class JRAnalyzer extends Analyzer {
	Analyzer defaultAnalyzer = new StandardAnalyzer();

	public TokenStream tokenStream(String fieldName, Reader reader) {
		Analyzer analyzer = (Analyzer)propertyAnalyzerMap.get(fieldName);
		if(analyzer!=null){
			return analyzer.tokenStream(fieldName, reader);
		}else{
			return this.defaultAnalyzer.tokenStream(fieldName, reader);
		}
	}
}

This very same JRAnalyzer is also used for the QueryParser in LuceneQueryBuilder, so this
will work also for searching IIUC. So, WDOT? I can implement it and send a patch, but if the
community is reluctant to it, I will have to do it for myself in a non jr code intrusive way.

Example of the SynonymProvider mentioned at the top:

If my suggested changes are accepted, things like a SynonymProvider becomes superfluous, and
very easy to add on the fly:

suppose, I want on the "body" property of my nodes always full searching with dutch synonyms.
This boils down to adding an analyzer for this property, that extends the DutchAnalyzer in
lucene, and that adds synonym functionality (very simple example in "lucene in action" book).
I think it is better to do synonyms during analyzing (as opposed to the SynonymProvider in
jr trunk), and simply use an analyzer for it. Ofcourse, a difference of using it, would be
that with the current SynonymProvider you specifically have to define that you do a synonymsearch
(~term), while with an analyzer, you define which properties whould be indexed with an synonymanalyzer,
and searched accordingly (without having to specify it),

So WDOT? Again, sry for mailing so much, just trying to sell my ideas :-) 

 
-- 

Hippo
Oosteinde 11
1017WT Amsterdam
The Netherlands
Tel  +31 (0)20 5224466
-------------------------------------------------------------
a.schrijvers@hippo.nl / ard@apache.org / http://www.hippo.nl
-------------------------------------------------------------- 

Mime
View raw message