lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Benjamin Douglas <bbdoug...@basistech.com>
Subject RE: Using org.apache.lucene.analysis.compound
Date Wed, 21 Oct 2009 17:40:27 GMT
Thanks for all of the answers so far!

Paul's question is similar to another aspect I am curious about:

Given the way the sample word is analyzed, is there anything in the scoring mechanism that
would rank "überwachungsgesetz" higher than "gesetzüberwachung" or "fleischgesetz"?

-----Original Message-----
From: Robert Muir [mailto:rcmuir@gmail.com] 
Sent: Wednesday, October 21, 2009 5:12 AM
To: java-user@lucene.apache.org
Subject: Re: Using org.apache.lucene.analysis.compound

Paul, there are two implementations in compounds, one is dictionary-based,
the other is hyphenation-grammar + dictionary (it restricts the
decompounding based on hyphenation rules). You could also subclass the
compound base class and implement your own.

I haven't seen any user-measures (relevance, etc), would be a cool thing to
see though.

I'm not sure I understand your last question, can you elaborate?
it might be that to improve some cases, you want to use the onlyLongestMatch
parameter:
@param onlyLongestMatch Add only the longest matching subword to the stream

for scoring, I think lucene's scoring might help too, because the original
word, without decompounding, is left as a token so if you search on an exact
match it should be ranked higher. (not sure if this is answering your
question)

On Wed, Oct 21, 2009 at 5:27 AM, Paul Libbrecht <paul@activemath.org> wrote:

>
> I'm interested to this analyzer.. it had escaped me and solves an old
> problem!
> Could you report about its usage:
> - did you have to feed words in a dictionary?
> - does anyone have user-measures already?
> ... and the last question for the research fun: is there any approach
> towards preferring Überwachunggesetz as a concept than, say,
> Fleischüberwachung? (again, that could be based on a dictionary probably).
>
> thanks in advance
>
> paul
>
>
> Le 21-oct.-09 à 04:00, Robert Muir a écrit :
>
>
>  hi, it will work because it will also decompound "Rindfleish" into Rind
>> and
>> fleish, with posIncr=0
>>
>> so if you index Rindfleischüberwachungsgesetz, then query with
>> "Rindfleish",
>> its matching because Rindfleish also gets decompounded into Rind and
>> fleish.
>>
>> On Tue, Oct 20, 2009 at 8:35 PM, Benjamin Douglas
>> <bbdouglas@basistech.com>wrote:
>>
>>  Hello,
>>>
>>> I've found a number of posts in different places talking about how to
>>> perform decompounding, but I haven't found too many discussing how to use
>>> the results of decompounding. If anyone can answer this question or point
>>> me
>>> to an existing discussion it would be very helpful.
>>>
>>> In the description of the org.apache.lucene.analysis.compound package, it
>>> gives the following example:
>>>
>>>      Rindfleischüberwachungsgesetz, 0, 29
>>>      Rind, 0, 4, posIncr=0
>>>      fleisch, 4, 11, posIncr=0
>>>      überwachung, 11, 22, posIncr=0
>>>      gesetz, 23, 29, posIncr=0
>>>
>>> And I see how this allows me to find single components such as "gesetz"
>>> or
>>> "Rind". But what if I want to find combinations of components such as
>>> "Rindfleisch" or "überwachungsgesetz"? It seems that the pattern of using
>>> posIncr=0 for all components excludes the possibility of finding
>>> sub-strings
>>> that are made up of multiple components.
>>>
>>> Any comments or thoughts would be appreciated.
>>>
>>> Ben Douglas
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>>
>>
>> --
>> Robert Muir
>> rcmuir@gmail.com
>>
>
>


-- 
Robert Muir
rcmuir@gmail.com
Mime
View raw message