lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From david.dav...@correo.aeat.es
Subject Problems with gaps removed with SynonymFilter
Date Mon, 23 Sep 2013 06:45:27 GMT
Hi, 

I am having a problem applying StopFilterFactory and 
SynonimFilterFactory. The problem is that SynonymFilter removes the gaps 
that were previously put by the StopFilterFactory. I'm applying filters in 

query time, because users need to change synonym lists frequently.

This is my schema, and an example of the issue:


String: "documentacion para agentes"

org.apache.solr.analysis.WhitespaceTokenizerFactory 
{luceneMatchVersion=LUCENE_35}
position        1       2       3
term text       documentación    para   agentes
startOffset     0       14      19
endOffset       13      18      26
org.apache.solr.analysis.LowerCaseFilterFactory 
{luceneMatchVersion=LUCENE_35}
position        1       2       3
term text       documentación    para   agentes
startOffset     0       14      19
endOffset       13      18      26
org.apache.solr.analysis.StopFilterFactory {words=stopwords_intranet.txt, 
ignoreCase=true, enablePositionIncrements=true, 
luceneMatchVersion=LUCENE_35}
position        1       3
term text       documentación   agentes
startOffset     0       19
endOffset       13      26
org.apache.solr.analysis.SynonymFilterFactory 
{synonyms=sinonimos_intranet.txt, expand=true, ignoreCase=true, 
luceneMatchVersion=LUCENE_35}
position        1       2
term text       documentación   agente
        archivo         agentes
type    SYNONYM SYNONYM
        SYNONYM SYNONYM
startOffset 0           19
        0               19
endOffset 13            26
        13              26


As you can see, the position should be 1 and 3, but SynonymFilter removes 
the gap and moves token from position 3 to 2
I've got the same problem with Solr 3.5 y 4.0. 
I don't know if it's a bug or an error with my configuration. In other 
schemas that I have worked with, I had always put the SynonymFilter 
previous to StopFilter, but in this I prefered using this order because of 

the big number of synonym that the list has (i.e. I don't want to generate 

a lot of synonyms for a word that I really wanted to remove).

Thanks,

David Dávila Atienza
AEAT - Departamento de Informática Tributaria

David Dávila Atienza
AEAT - Departamento de Informática Tributaria
Subdirección de Tecnologías de Análisis de la Información e Investigación 
del Fraude
Área de Infraestructuras
Teléfono: 915831543
Extensión: 31543
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message