jackrabbit-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ard Schrijvers" <a.schrijv...@hippo.nl>
Subject RE: IndexingConfiguration jr 1.4 release, analyzing, searching and synonymprovider
Date Tue, 14 Aug 2007 10:03:21 GMT

> > class JRAnalyzer extends Analyzer { Analyzer defaultAnalyzer = new
> > StandardAnalyzer();
> > 
> > public TokenStream tokenStream(String fieldName, Reader 
> reader) { Analyzer
> > analyzer = (Analyzer)propertyAnalyzerMap.get(fieldName); 
> if(analyzer!=null){ 
> > return analyzer.tokenStream(fieldName, reader); }else{ return
> > this.defaultAnalyzer.tokenStream(fieldName, reader); } } }
> > 
> > This very same JRAnalyzer is also used for the QueryParser in
> > LuceneQueryBuilder, so this will work also for searching 
> IIUC. So, WDOT? I
> > can implement it and send a patch, but if the community is 
> reluctant to it, I
> > will have to do it for myself in a non jr code intrusive way.
> 
> This would work quite well for jcr:contains functions that 
> operate on a 
> property. However I'm not sure what to do with this:
> 
> //*[jcr:contains(., 'h├Ągar')]
> 
> the node scope does not indicate which analyzer to use for 
> the query statement. 
> Would we just run the statement through all analyzers and 
> combine them in an OR 
> query?

Hmm, good point :-) OR-ing the terms with all analyzers seems wrong to me (apart from possibly
inefficient), because you might get results you should not get: I only know a dutch example:
suppose you index "branden" (=burn) with Dutch analyzer. This results in the term "brand"
because of stemming. Now, OR-ing, might return you hits in English text that contains "brand"
which you aren't looking for at all. Anyway, you have a good point about this problem, but
since I think multilingual indexing might be quite useful, I'll give it another thought.

> 
> > Example of the SynonymProvider mentioned at the top:
> > 
> > If my suggested changes are accepted, things like a 
> SynonymProvider becomes
> > superfluous, and very easy to add on the fly:
> > 
> > suppose, I want on the "body" property of my nodes always 
> full searching with
> > dutch synonyms. This boils down to adding an analyzer for 
> this property, that
> > extends the DutchAnalyzer in lucene, and that adds synonym 
> functionality
> > (very simple example in "lucene in action" book). I think 
> it is better to do
> > synonyms during analyzing (as opposed to the 
> SynonymProvider in jr trunk),
> > and simply use an analyzer for it. Ofcourse, a difference 
> of using it, would
> > be that with the current SynonymProvider you specifically 
> have to define that
> > you do a synonymsearch (~term), while with an analyzer, you 
> define which
> > properties whould be indexed with an synonymanalyzer, and searched
> > accordingly (without having to specify it),
> 
> well, those are actually the reasons why I implemented it the 
> other way. If you 
> go the analyzer way to expand synonyms you have to re-index 
> the complete content 
> if you want to add a single synonym. 

No, this is not the case IIUC. Adding a synonym will directly result in an extra OR term (though
it is not really important regarding the issue)

>I also wanted the user 
> to decide if 
> synonyms should be considered. Again this would not be 
> possible if the analyzer 
> automatically adds synonyms.
> 
> but fortunately, with jackrabbit both is possible ;) if one 
> prefers to expand 
> terms on index time, just use an appropriate analyzer and 
> don't configure a 
> SynonymProvider.

Different horses for different courses, I understand your reasoning. 

Regards Ard

> 

Mime
View raw message