lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris Morley" <>
Subject re: Implementing custom analyzer for multi-language stemming
Date Wed, 30 Jul 2014 17:52:20 GMT
I know has a plugin for elasticsearch that extends 
stemming/lemmatization to work across 40 natural languages.
I'm not sure what they have for Solr, but I think something like that may 
exist as well.


 From: "Eugene" <>
Sent: Wednesday, July 30, 2014 1:48 PM
Subject: Implementing custom analyzer for multi-language stemming

Hello, fellow Solr and Lucene users and developers!

In our project we receive text from users in different languages. We
detect language automatically and use Google Translate APIs a lot (so
having arbitrary number of languages in our system doesn't concern us).
However we need to be able to search using stemming. Having nearly hundred
of fields (several fields for each language with language-specific
stemmers) listed in our search query is not an option. So we need a way to
have a single index which has stemmed tokens for different languages. I
have two questions:

1. Are there already (third-party) custom multi-language stemming
analyzers? (I doubt that no one else ran into this issue)

2. If I'm going to implement such analyzer myself, could you please
suggest a better way to 'pass' detected language value into such analyzer?
Detecting language in analyzer itself is not an option, because: a) we
already detect it in other place b) we do it based on combined values of
many fields ('name', 'topic', 'description', etc.), while current field 
be to short for reliable detection c) sometimes we just want to specify
language explicitly. The obvious hack would be to prepend ISO 639-1 code 
field value. But I'd like to believe that Solr allows for cleaner 
I could think about either: a) custom query parameter (but I guess, it 
require modifying request handlers, etc. which is highly undesirable) b)
getting value from other field (we obviously have 'language' field and we
do not have mixed-language records). If it is possible, could you please
describe the mechanism for doing this or point to relevant code examples?
Thank you very much and have a good day!

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message