jackrabbit-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ard Schrijvers" <a.schrijv...@hippo.nl>
Subject RE: IndexingConfiguration jr 1.4 release, analyzing, searching and synonymprovider
Date Wed, 22 Aug 2007 10:21:45 GMT
Excuses for not indenting properly Bertrand's text, I am using webmail. See comments below
the second ------ :

>Bertrand Delacretaz wrote:
----------------------------------
"Yes, given that many Lucene TokenFilters are available, this is useful I think.

I see two potential issues that you might want to take into account:

1) With configurable indexing analyzers, people sometimes have a hard
time figuring out how exactly their data is indexed (and why they
don't find it later).

Solr provides an analysis test page for that (see "Solr's content
analysis test page" in [1]). In the case of Jackrabbit, maybe logging
the filtered values of fields at the DEBUG level would help.

2) As discussed previously, one problem with this is which analyzer to
use when running a query that applies to several fields. In Solr, you
can configure a different analyzer for querying, it's probably the
best solution.

People then have to make sure their config is consistent for indexing
and querying, and might need in some cases to provide their own custom
QueryAnalyzer to achieve this. For example one that provides fake
synonyms for a token, with each synonym being the result of the one of
the analysis methods used. This can get tricky depending on the
configured analysis, when searching in multiple fields.

See also http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
for more info on how Solr manages the analyzers."
---------------------------------

I think I do not have these two problems for my aimed solution: I'll add one general analyzer
to Jackrabbit, that looks something like:

class JRAnalyzerImpl extends Analyzer {
        Analyzer defaultAnalyzer = new StandardAnalyzer();
        
        public TokenStream tokenStream(String fieldName, Reader reader) {
           Analyzer analyzer = (Analyzer)configuredProperties.get(fieldName);
            if (analyzer != null) {
            	return analyzer.tokenStream(fieldName, reader);
            } 
            else {
                return this.defaultAnalyzer.tokenStream(fieldName, reader);
            }
        }
    }

Now, all I need to do is hold a map of configuredProperties, which maps fieldname to the configured
analyzer. When running a query for different field, I use the JRAnalyzerImpl as always, but
by returning different tokenStream based on an analyzer I implicitely use different analyzers
for each field that have it configured like this. Since this analyzer is used for indexing
*and* querying, on a per field basis, it will always work. 

Might this be a better solution for Solr querying as well? It does seem to me overcomplicated
that people have to take care of choosing an appropriate analyzer for querying while this
does not seem to be necessary to me. Not finding a hit while you would expect one is pretty
hard to solve sometimes, certainly if you don't know where to look, or understand lucene analyzing
to some extend. WDYT?

Regards Ard


-Bertrand

[1] http://www.xml.com/lpt/a/1668




Mime
  • Unnamed multipart/mixed (inline, None, 0 bytes)
View raw message