lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wettin <>
Subject Re: Reverse stemmer?
Date Fri, 09 Oct 2009 03:07:20 GMT
For the case where the text contains mixed languages there are  
solutions that simutainously use morphological rules of two or more  
languages. Coveo search does this but I don't know what their solution  
looks like. I suppose one way to do it would be to stem all tokens  
with all algorithms and add the results as synonyms, both at index and  
query time. Sounds a bit expensive though. If you have enough  
resources to spend at index time you could probably end up with a more  
optimized index by using dictionaries to back language identification  
per word and power brute force stemming (way slow) that's compatible  
with algortihmic query time stemmers (way speedier). I'm just guessing  


8 okt 2009 kl. 21.20 skrev Jason Rutherglen:

> Out of curiousity and perhaps for practical purposes, how does one
> handle mixed language documents?  I suppose one could extract the
> words of a particular language and place it in a lang specific field?
> Are there libraries to perform this (yet)?
> On Thu, Oct 8, 2009 at 6:32 AM, Christian Reuschling
> <> wrote:
>> Hi,
>> looking up the different terms with a common stem can be useful in  
>> different
>> scenarios - so I don't want to judge it whether someone needs it or  
>> not.
>> E.g., in the case you have multilingual documents in your index, it  
>> is straight
>> forward to determine the language of the documents in order to  
>> choose the right
>> stemmer. At least this is right for document with homogenous  
>> language.
>> Althought this is true at indexing time, the language  
>> classification for the
>> user query is not such trivial - and you have to do this in order  
>> to stem the
>> query terms for searching. One possibility would be to search for  
>> the stems
>> given from all stemmers - but in this case you will receive many  
>> wrong
>> searching terms, thus much noise in the result lists.
>> Another possibility can be to offer all 'potential synonyms' of the  
>> query terms
>> to the user - where he can choose whether these are right or not.  
>> In this case
>> you need exactly the lookup 'queryTerm->stem->terms with same  
>> stem'. This can
>> be much more precise, the lacks are of course the interaction  
>> needed by the
>> user and longer queries.
>> To realize this, someone could write a specific Analyzer that  
>> stores this
>> relationship additionally e.g. into a database. I personaly don't  
>> know any
>> possibility to read this directly out of the Lucene index.
>> In the case someone has best practices or an idea how processing  
>> multilingual
>> indices can be done better, I would be appreciated to read / hear  
>> about this.
>> all best
>> Chris
>> On Tue, 6 Oct 2009 16:31:36 +0900
>> David Leangen <> wrote:
>>> Hello,
>>> I've been using Lucene in a very basic way for some time now, and  
>>> I'm
>>> starting to take advantage of some of the linguistic capabilities  
>>> only
>>> now.
>>> I am making use of the snowball analyzer for stemming, and it works
>>> very well.
>>> Question: is there any such thing as a "reverse stemmer"? In other
>>> words, given the stem of a word, is there any algorithm to find the
>>> original word? Or is this just fantasy? ;-)
>>> Now, I understand that there is a 1:n mapping of stems:words. I can
>>> deal with that.
>>> Thanks!
>>> =David
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail:
>>> For additional commands, e-mail:
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message