lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From F Knudson <>
Subject Tokenizing and searching named character entity references
Date Thu, 24 Jul 2008 13:53:35 GMT


I am working with many different data sources - some source employ "entity
references" ; others do not.  My goal is to make the searching across
sources as consistent as possible.

Example text - 

Source1:   weakening H&delta; absorption
Source1:   zero-field gap &omega;

Source2:  weakening H delta absorption
Source2:  zero-field gap omega

Using the tokenizer solr.HTMLStripWhitespaceTokenizerFactory for Source1 -
the entity is replaced with the "named character entity" - 

This works great.  

But I want the searching tokens to be identical for each source.  I need to
capture &delta;  as a token.

<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
       <tokenizer class="solr.HTMLStripWhitespaceTokenizerFactory"/> 
       <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true"/>
       <filter class="solr.ISOLatin1AccentFilterFactory"/>
       <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateA
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
Is this possible with the SOLR supplied tokenizers?  I experimented with
different combinations and orders and was not successful.

Is this possible using synonyms?  I also experimented with this route but
again was not successful.

Do I need to create a custom tokenizer?

View this message in context:
Sent from the Solr - User mailing list archive at

View raw message