lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From fredericbaroz <fredericba...@gmail.com>
Subject Text analysis which expand the index with many words break subsequent analysis
Date Wed, 04 Mar 2015 19:18:47 GMT
Hello,

My name is Frédéric Baroz. I work as a in-hospital physician in Intern
Medicin in Switzerland (i speak french) and software engineer. I work in
medical informatics and I m currently making some research about "semantic
search" for in-hosp physician who are daily confronted with searching
medical information.

I am quite a newby in lucene/solr and I ve spend most of my time this last
year, getting aquainted with this briliant technology. In the context of my
work, I noticed that analysis, index-time or query-time, sometimes need to
expand the text by injecting more or less processed tokens one after the
other.

One common scenario is to have the system "prefer" exact word match by
injecting in the index a stemmed version along with the unmolested version
of a token. Other tokenfilters have a similar behavior, like
KeywordRepeatFilter which inject 2 version of each processed token, of which
one is flagged in order to skip the stemming phase. A last example is
AutoPhrasingTokenFilter, contribution from Lucidwork which offers a
"workaround" for multi-term synonym matching (see
http://lucidworks.com/blog/automatic-phrase-tokenization-improving-lucene-search-precision-by-more-precise-linguistic-analysis/)

One problem to this approach, as I understand it, is that filters that adopt
this behavior, break analysis capabilities for subsequent filters. For
example, if we use KeywordRepeatFilter and then AutoPhraseFilter, the latter
will have no effect since it *never sees* the token series that it was
waiting for, since one extra-word has been added after each word, because of
KeywordRepeatFilter.

In my opinion, tokens "to be injected" should be injected all at once, after
the original token stream has been emitted, and not after each token seen by
the filter. This would be in order not to break the ordered sequence of
tokens, which in my opinion, carries some important information.

So my question is: has anyone already adressed this problem and are there
any workarounds that one might have thought of?

and for the record, today, google is no friend to me ;)

Thanks in advance for help, 

Frédéric Baroz



--
View this message in context: http://lucene.472066.n3.nabble.com/Text-analysis-which-expand-the-index-with-many-words-break-subsequent-analysis-tp4191001.html
Sent from the Solr - User mailing list archive at Nabble.com.

Mime
View raw message