lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From fredericbaroz <>
Subject Text analysis which expand the index with many words break subsequent analysis
Date Wed, 04 Mar 2015 19:18:47 GMT

My name is Frédéric Baroz. I work as a in-hospital physician in Intern
Medicin in Switzerland (i speak french) and software engineer. I work in
medical informatics and I m currently making some research about "semantic
search" for in-hosp physician who are daily confronted with searching
medical information.

I am quite a newby in lucene/solr and I ve spend most of my time this last
year, getting aquainted with this briliant technology. In the context of my
work, I noticed that analysis, index-time or query-time, sometimes need to
expand the text by injecting more or less processed tokens one after the

One common scenario is to have the system "prefer" exact word match by
injecting in the index a stemmed version along with the unmolested version
of a token. Other tokenfilters have a similar behavior, like
KeywordRepeatFilter which inject 2 version of each processed token, of which
one is flagged in order to skip the stemming phase. A last example is
AutoPhrasingTokenFilter, contribution from Lucidwork which offers a
"workaround" for multi-term synonym matching (see

One problem to this approach, as I understand it, is that filters that adopt
this behavior, break analysis capabilities for subsequent filters. For
example, if we use KeywordRepeatFilter and then AutoPhraseFilter, the latter
will have no effect since it *never sees* the token series that it was
waiting for, since one extra-word has been added after each word, because of

In my opinion, tokens "to be injected" should be injected all at once, after
the original token stream has been emitted, and not after each token seen by
the filter. This would be in order not to break the ordered sequence of
tokens, which in my opinion, carries some important information.

So my question is: has anyone already adressed this problem and are there
any workarounds that one might have thought of?

and for the record, today, google is no friend to me ;)

Thanks in advance for help, 

Frédéric Baroz

View this message in context:
Sent from the Solr - User mailing list archive at

View raw message