lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From F Knudson <fknud...@lanl.gov>
Subject Tokenizing and searching named character entity references
Date Thu, 24 Jul 2008 13:53:35 GMT

Greetings:

I am working with many different data sources - some source employ "entity
references" ; others do not.  My goal is to make the searching across
sources as consistent as possible.

Example text - 

Source1:   weakening H&delta; absorption
Source1:   zero-field gap &omega;

Source2:  weakening H delta absorption
Source2:  zero-field gap omega

Using the tokenizer solr.HTMLStripWhitespaceTokenizerFactory for Source1 -
the entity is replaced with the "named character entity" - 

This works great.  

But I want the searching tokens to be identical for each source.  I need to
capture &delta;  as a token.


<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
       <tokenizer class="solr.HTMLStripWhitespaceTokenizerFactory"/> 
       <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true"/>
       <filter class="solr.ISOLatin1AccentFilterFactory"/>
       <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateA
ll="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
</fieldType>
 
Is this possible with the SOLR supplied tokenizers?  I experimented with
different combinations and orders and was not successful.

Is this possible using synonyms?  I also experimented with this route but
again was not successful.

Do I need to create a custom tokenizer?

Thanks
Frances
-- 
View this message in context: http://www.nabble.com/Tokenizing-and-searching-named-character-entity-references-tp18632403p18632403.html
Sent from the Solr - User mailing list archive at Nabble.com.


Mime
View raw message