lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "G.S.J. Lobbestael" <>
Subject Re: Punctuation marks in documents prevent recognition of synonyms at indexing?
Date Sun, 27 Sep 2009 11:55:48 GMT
Thanks, this helps. 
But our synonym file has some 16,000 sets of synonyms.

Should the wiki warn users?
- WhitespaceTokenizerFactory with synonyms at indexing will not expand synonyms in text "...
synonym[punctuation mark] ..."

- the individual synonyms in your synonym file should be in a form as if they were sent through
the tokenizers which come before the SynonymFilterFactory

With a WhitespaceTokenizerFactory:
Flaubert's Parrot, Julian Barnes
A History of the World in 10½ Chapters, Julian Barnes
England\, England, Julian Barnes
Arthur & George, Julian Barnes
Absalom\, Absalom!, William Faulkner
k-nearest neighbors algorithm, k-NN, k nn

With a StandardTokenizerFactory:
Flaubert's Parrot, Julian Barnes
A History of the World in 10 Chapters, Julian Barnes
England England, Julian Barnes
Arthur George, Julian Barnes
Absalom Absalom, William Faulkner
k nearest neighbors algorithm, k-NN, k nn, knn 

This means that when changing the TokenizerFactory you also might have to change your synonym
file. But the change may be irreversible (you can't reconstruct the first version from the
second one).

Would it be possible for Solr to apply the Tokenizer in use while reading the synonym file?
Then the user would only need the original synonym file, and their could not be a conflict.

> > You lose the WordDelimiterFilterFactory functionality:
> > 
> > Syn.txt has: ADC, HIV-dementie
> > Search on "ADC" doesn't find document with "HIV-dementie".
> synonym filter can handle multi word synonyms. Replace Syn.txt to
> Syn.txt has: ADC, HIV dementie
> And search on "ADC" will find document with "HIV-dementie".
> hope this helps.

View raw message