lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bjørn Hjelle <bjorn.hje...@gmail.com>
Subject (Edge)NGramFilterFactory and highlight
Date Fri, 19 Dec 2014 14:26:49 GMT
Hi,

based on this example:
http://www.cominvent.com/2012/01/25/super-flexible-autocomplete-with-solr/
I have earlier successfully implemented highlight of terms in
(Edge)NGram-analyzed fields.

In a new project, however, with Solr 4.10.2 it does not work.

In the Solr admin analysis page I see the following in Solr 4.10.2 (simplified):

ENGTF  text  t  te  tes  test
               start 0  0   0    0
               end   4  4   4    4

But if I change to LUCENE_43 in solrconfig.xml, and reload the
analysis page I get this:

ENGTF  text  t  te  tes  test
               start 0  0   0    0
               end   1  2   3    4

So, in 4.10.2 it is not able to find the correct end-positions and the
highlighter will instead highlight the complete word ("test" in this
case).


To reproduce  this:
1. download Solr 4.10.2
2. In the collection1 schema.xml, add field type:


        <fieldType name="autocomplete_ngram" class="solr.TextField">
            <analyzer type="index">
                <charFilter class="solr.MappingCharFilterFactory"
mapping="mapping-ISOLatin1Accent.txt"/>
                <tokenizer class="solr.StandardTokenizerFactory"/>
                <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="0"
catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
                <filter class="solr.LowerCaseFilterFactory"/>
                <filter class="solr.EdgeNGramFilterFactory"
maxGramSize="20" minGramSize="1"/>
                <filter class="solr.PatternReplaceFilterFactory"
pattern="([^\w\d\*æøåÆØÅ ])" replacement="" replace="all"/>
            </analyzer>
            <analyzer type="query">
                <charFilter class="solr.MappingCharFilterFactory"
mapping="mapping-ISOLatin1Accent.txt"/>
                <tokenizer class="solr.StandardTokenizerFactory"/>
                <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="0" generateNumberParts="0" catenateWords="0"
catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/>
                <filter class="solr.LowerCaseFilterFactory"/>
                <filter class="solr.PatternReplaceFilterFactory"
pattern="([^\w\d\*æøåÆØÅ ])" replacement="" replace="all"/>
                <filter class="solr.PatternReplaceFilterFactory"
pattern="^(.{20})(.*)?" replacement="$1" replace="all"/>
            </analyzer>
        </fieldType>

3. Start solr and in analysis page add "Test" to Field Value (Index)
-field and check the output.
4. Then change to this in solrconfig.xml

  <luceneMatchVersion>LUCENE_43</luceneMatchVersion>

5. reload the core and reload the analyis page.
6. you will now see that the end-positions are correct.



Any ideas on how to make this work with Solr 4.10.2 without resorting
to changing lucene version in solrconfig.xml?


Thanks,
Bjørn

Mime
View raw message