lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Goetzke" <uwe.goet...@healy-hudson.com>
Subject AW: Reverse stemmer?
Date Fri, 09 Oct 2009 08:37:40 GMT
We use a statistical approach. So we have little language dependent context in our search.

A simplified description:
Our data gets indexed with a "normal" analyzer in a data index.
In a second step we index all terms of defined search fields with a different analyzer which
uses bigrams on the character level in a separate terms index. 
If the user searches a phrase we expand each term of the phrase with the matching terms of
the terms index and uses this to search in the data index. The matching terms are found by
matching bigrams with a certain degree of tolerance.

So if we have the following data indexed:
aabcc
bbca
abbc
aacc
abcbbca

and the user searches for bbc we would search for 
(term with boost factor)
abbc^4
bbca^3
abcbbca^2
aabcc^1 (matched based on a tolerance factor)
The interesting stuff is how to boost this expanded terms to get a understandable ordering
of the search results regarding to the phrase entered by the user.

Regards

Uwe Goetzke
Healy Hudson

-----Ursprüngliche Nachricht-----
Von: Jason Rutherglen [mailto:jason.rutherglen@gmail.com] 
Gesendet: Donnerstag, 8. Oktober 2009 21:20
An: java-user@lucene.apache.org
Betreff: Re: Reverse stemmer?

Out of curiousity and perhaps for practical purposes, how does one
handle mixed language documents?  I suppose one could extract the
words of a particular language and place it in a lang specific field?
Are there libraries to perform this (yet)?

On Thu, Oct 8, 2009 at 6:32 AM, Christian Reuschling
<christian.reuschling@gmail.com> wrote:
> Hi,
>
> looking up the different terms with a common stem can be useful in different
> scenarios - so I don't want to judge it whether someone needs it or not.
>
> E.g., in the case you have multilingual documents in your index, it is straight
> forward to determine the language of the documents in order to choose the right
> stemmer. At least this is right for document with homogenous language.
>
> Althought this is true at indexing time, the language classification for the
> user query is not such trivial - and you have to do this in order to stem the
> query terms for searching. One possibility would be to search for the stems
> given from all stemmers - but in this case you will receive many wrong
> searching terms, thus much noise in the result lists.
>
> Another possibility can be to offer all 'potential synonyms' of the query terms
> to the user - where he can choose whether these are right or not. In this case
> you need exactly the lookup 'queryTerm->stem->terms with same stem'. This can
> be much more precise, the lacks are of course the interaction needed by the
> user and longer queries.
>
> To realize this, someone could write a specific Analyzer that stores this
> relationship additionally e.g. into a database. I personaly don't know any
> possibility to read this directly out of the Lucene index.
>
>
> In the case someone has best practices or an idea how processing multilingual
> indices can be done better, I would be appreciated to read / hear about this.
>
>
>
> all best
>
> Chris
>
>
> On Tue, 6 Oct 2009 16:31:36 +0900
> David Leangen <apache@leangen.net> wrote:
>
>>
>> Hello,
>>
>> I've been using Lucene in a very basic way for some time now, and I'm
>> starting to take advantage of some of the linguistic capabilities only
>> now.
>>
>> I am making use of the snowball analyzer for stemming, and it works
>> very well.
>>
>>
>> Question: is there any such thing as a "reverse stemmer"? In other
>> words, given the stem of a word, is there any algorithm to find the
>> original word? Or is this just fantasy? ;-)
>>
>> Now, I understand that there is a 1:n mapping of stems:words. I can
>> deal with that.
>>
>>
>> Thanks!
>> =David
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


-----------------------------------------------------------------------
Healy Hudson GmbH - D-55252 Mainz Kastel
Geschäftsführer Christian Konhäuser - Amtsgericht Wiesbaden HRB 12076

Diese Email ist vertraulich. Wenn Sie nicht der beabsichtigte Empfänger sind, dürfen Sie
die Informationen nicht offen legen oder benutzen. Wenn Sie diese Email durch einen Fehler
bekommen haben, teilen Sie uns dies bitte umgehend mit, indem Sie diese Email an den Absender
zurückschicken. Bitte löschen Sie danach diese Email.
This email is confidential. If you are not the intended recipient, you must not disclose or
use this information contained in it. If you have received this email in error please tell
us immediately by return email and delete the document.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message