lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Libbrecht <p...@activemath.org>
Subject Re: Using org.apache.lucene.analysis.compound
Date Wed, 21 Oct 2009 20:19:34 GMT
Great,

now the next question: which dictionary to do you guys use? How big  
can it be?
Is 50000 words acceptable?

paul


Le 21-oct.-09 à 21:23, Robert Muir a écrit :

> Paul, i think in general scoring should take care of this too, its  
> all about
> your dictionary, same as the previous example.
> this is because überwachungsgesetz matches 3 tokens:  
> überwachungsgesetz,
> überwachung, gesetz
> but überwachung gesetz only matches 2.
>
> überwachungsgesetz
> 0.37040412 = (MATCH) sum of:
>  0.10848885 = (MATCH) weight(field:überwachungsgesetz in 0), product  
> of:
>    0.5 = queryWeight(field:überwachungsgesetz), product of:
>      0.30685282 = idf(docFreq=1, maxDocs=1)
>      1.6294457 = queryNorm
>    0.2169777 = (MATCH) fieldWeight(field:überwachungsgesetz in 0),  
> product
> of:
>      1.4142135 = tf(termFreq(field:überwachungsgesetz)=2)
>      0.30685282 = idf(docFreq=1, maxDocs=1)
>      0.5 = fieldNorm(field=field, doc=0)
>  0.076713204 = (MATCH) weight(field:überwachung in 0), product of:
>    0.5 = queryWeight(field:überwachung), product of:
>      0.30685282 = idf(docFreq=1, maxDocs=1)
>      1.6294457 = queryNorm
>    0.15342641 = (MATCH) fieldWeight(field:überwachung in 0), product  
> of:
>      1.0 = tf(termFreq(field:überwachung)=1)
>      0.30685282 = idf(docFreq=1, maxDocs=1)
>      0.5 = fieldNorm(field=field, doc=0)
>  0.10848885 = (MATCH) weight(field:überwachungsgesetz in 0), product  
> of:
>    0.5 = queryWeight(field:überwachungsgesetz), product of:
>      0.30685282 = idf(docFreq=1, maxDocs=1)
>      1.6294457 = queryNorm
>    0.2169777 = (MATCH) fieldWeight(field:überwachungsgesetz in 0),  
> product
> of:
>      1.4142135 = tf(termFreq(field:überwachungsgesetz)=2)
>      0.30685282 = idf(docFreq=1, maxDocs=1)
>      0.5 = fieldNorm(field=field, doc=0)
>  0.076713204 = (MATCH) weight(field:gesetz in 0), product of:
>    0.5 = queryWeight(field:gesetz), product of:
>      0.30685282 = idf(docFreq=1, maxDocs=1)
>      1.6294457 = queryNorm
>    0.15342641 = (MATCH) fieldWeight(field:gesetz in 0), product of:
>      1.0 = tf(termFreq(field:gesetz)=1)
>      0.30685282 = idf(docFreq=1, maxDocs=1)
>      0.5 = fieldNorm(field=field, doc=0)
>
> überwachung gesetz
> 0.30685282 = (MATCH) sum of:
>  0.15342641 = (MATCH) sum of:
>    0.076713204 = (MATCH) weight(field:überwachung in 0), product of:
>      0.5 = queryWeight(field:überwachung), product of:
>        0.30685282 = idf(docFreq=1, maxDocs=1)
>        1.6294457 = queryNorm
>      0.15342641 = (MATCH) fieldWeight(field:überwachung in 0),  
> product of:
>        1.0 = tf(termFreq(field:überwachung)=1)
>        0.30685282 = idf(docFreq=1, maxDocs=1)
>        0.5 = fieldNorm(field=field, doc=0)
>    0.076713204 = (MATCH) weight(field:überwachung in 0), product of:
>      0.5 = queryWeight(field:überwachung), product of:
>        0.30685282 = idf(docFreq=1, maxDocs=1)
>        1.6294457 = queryNorm
>      0.15342641 = (MATCH) fieldWeight(field:überwachung in 0),  
> product of:
>        1.0 = tf(termFreq(field:überwachung)=1)
>        0.30685282 = idf(docFreq=1, maxDocs=1)
>        0.5 = fieldNorm(field=field, doc=0)
>  0.15342641 = (MATCH) sum of:
>    0.076713204 = (MATCH) weight(field:gesetz in 0), product of:
>      0.5 = queryWeight(field:gesetz), product of:
>        0.30685282 = idf(docFreq=1, maxDocs=1)
>        1.6294457 = queryNorm
>      0.15342641 = (MATCH) fieldWeight(field:gesetz in 0), product of:
>        1.0 = tf(termFreq(field:gesetz)=1)
>        0.30685282 = idf(docFreq=1, maxDocs=1)
>        0.5 = fieldNorm(field=field, doc=0)
>    0.076713204 = (MATCH) weight(field:gesetz in 0), product of:
>      0.5 = queryWeight(field:gesetz), product of:
>        0.30685282 = idf(docFreq=1, maxDocs=1)
>        1.6294457 = queryNorm
>      0.15342641 = (MATCH) fieldWeight(field:gesetz in 0), product of:
>        1.0 = tf(termFreq(field:gesetz)=1)
>        0.30685282 = idf(docFreq=1, maxDocs=1)
>        0.5 = fieldNorm(field=field, doc=0)
>
> On Wed, Oct 21, 2009 at 3:16 PM, Paul Libbrecht  
> <paul@activemath.org> wrote:
>
>> Can the dictionary have weights?
>>
>> überwachungsgesetz alone probably needs a higher rank than  
>> überwachung and
>> gesetzt or?
>>
>> paul
>>
>>
>> Le 21-oct.-09 à 21:09, Benjamin Douglas a écrit :
>>
>>
>> OK, that makes sense. So I just need to add all of the sub- 
>> compounds that
>>> are real words at posIncr=0, even if they are combinations of other
>>> sub-compounds.
>>>
>>> Thanks!
>>>
>>> -----Original Message-----
>>> From: Robert Muir [mailto:rcmuir@gmail.com]
>>> Sent: Wednesday, October 21, 2009 11:49 AM
>>> To: java-user@lucene.apache.org
>>> Subject: Re: Using org.apache.lucene.analysis.compound
>>>
>>> yes, your dictionary :)
>>>
>>> if überwachungsgesetz is a real word, add it to your dictionary.
>>>
>>> for example, if your dictionary is { "Rind", "Fleisch", "Draht",  
>>> "Schere",
>>> "Gesetz", "Aufgabe", "Überwachung" }, and you index
>>> Rindfleischüberwachungsgesetz, then all 3 queries will have the same
>>> score.
>>> but if you expand the dictionary to { "Rind", "Fleisch", "Draht",
>>> "Schere",
>>> "Gesetz", "Aufgabe", "Überwachung", "Überwachungsgesetz" }, then  
>>> this
>>> makes
>>> a big difference.
>>>
>>> all 3 queries will still match, but überwachungsgesetz will have a  
>>> higher
>>> score. this is because now things are analyzed differently:
>>> Rindfleischüberwachungsgesetz will be decompounded as before, but  
>>> with an
>>> additional token: Überwachungsgesetz.
>>> so back to your original question, these 'concatenations' of  
>>> multiple
>>> components, yes compounds will do that, if they are real words.  
>>> but it
>>> won't
>>> just make them up.
>>>
>>> "überwachungsgesetz"
>>> 0.23013961 = (MATCH) sum of:
>>> 0.057534903 = (MATCH) weight(field:überwachungsgesetz in 0),  
>>> product of:
>>>  0.5 = queryWeight(field:überwachungsgesetz), product of:
>>>    0.30685282 = idf(docFreq=1, maxDocs=1)
>>>    1.6294457 = queryNorm
>>>  0.11506981 = (MATCH) fieldWeight(field:überwachungsgesetz in 0),  
>>> product
>>> of:
>>>    1.0 = tf(termFreq(field:überwachungsgesetz)=1)
>>>    0.30685282 = idf(docFreq=1, maxDocs=1)
>>>    0.375 = fieldNorm(field=field, doc=0)
>>> 0.057534903 = (MATCH) weight(field:überwachung in 0), product of:
>>>  0.5 = queryWeight(field:überwachung), product of:
>>>    0.30685282 = idf(docFreq=1, maxDocs=1)
>>>    1.6294457 = queryNorm
>>>  0.11506981 = (MATCH) fieldWeight(field:überwachung in 0), product  
>>> of:
>>>    1.0 = tf(termFreq(field:überwachung)=1)
>>>    0.30685282 = idf(docFreq=1, maxDocs=1)
>>>    0.375 = fieldNorm(field=field, doc=0)
>>> 0.057534903 = (MATCH) weight(field:überwachungsgesetz in 0),  
>>> product of:
>>>  0.5 = queryWeight(field:überwachungsgesetz), product of:
>>>    0.30685282 = idf(docFreq=1, maxDocs=1)
>>>    1.6294457 = queryNorm
>>>  0.11506981 = (MATCH) fieldWeight(field:überwachungsgesetz in 0),  
>>> product
>>> of:
>>>    1.0 = tf(termFreq(field:überwachungsgesetz)=1)
>>>    0.30685282 = idf(docFreq=1, maxDocs=1)
>>>    0.375 = fieldNorm(field=field, doc=0)
>>> 0.057534903 = (MATCH) weight(field:gesetz in 0), product of:
>>>  0.5 = queryWeight(field:gesetz), product of:
>>>    0.30685282 = idf(docFreq=1, maxDocs=1)
>>>    1.6294457 = queryNorm
>>>  0.11506981 = (MATCH) fieldWeight(field:gesetz in 0), product of:
>>>    1.0 = tf(termFreq(field:gesetz)=1)
>>>    0.30685282 = idf(docFreq=1, maxDocs=1)
>>>    0.375 = fieldNorm(field=field, doc=0)
>>>
>>> "gesetzüberwachung"
>>> 0.064782135 = (MATCH) sum of:
>>> 0.032391068 = (MATCH) weight(field:gesetz in 0), product of:
>>>  0.2814906 = queryWeight(field:gesetz), product of:
>>>    0.30685282 = idf(docFreq=1, maxDocs=1)
>>>    0.9173473 = queryNorm
>>>  0.11506981 = (MATCH) fieldWeight(field:gesetz in 0), product of:
>>>    1.0 = tf(termFreq(field:gesetz)=1)
>>>    0.30685282 = idf(docFreq=1, maxDocs=1)
>>>    0.375 = fieldNorm(field=field, doc=0)
>>> 0.032391068 = (MATCH) weight(field:überwachung in 0), product of:
>>>  0.2814906 = queryWeight(field:überwachung), product of:
>>>    0.30685282 = idf(docFreq=1, maxDocs=1)
>>>    0.9173473 = queryNorm
>>>  0.11506981 = (MATCH) fieldWeight(field:überwachung in 0), product  
>>> of:
>>>    1.0 = tf(termFreq(field:überwachung)=1)
>>>    0.30685282 = idf(docFreq=1, maxDocs=1)
>>>    0.375 = fieldNorm(field=field, doc=0)
>>>
>>> "fleischgesetz"
>>> 0.064782135 = (MATCH) sum of:
>>> 0.032391068 = (MATCH) weight(field:fleisch in 0), product of:
>>>  0.2814906 = queryWeight(field:fleisch), product of:
>>>    0.30685282 = idf(docFreq=1, maxDocs=1)
>>>    0.9173473 = queryNorm
>>>  0.11506981 = (MATCH) fieldWeight(field:fleisch in 0), product of:
>>>    1.0 = tf(termFreq(field:fleisch)=1)
>>>    0.30685282 = idf(docFreq=1, maxDocs=1)
>>>    0.375 = fieldNorm(field=field, doc=0)
>>> 0.032391068 = (MATCH) weight(field:gesetz in 0), product of:
>>>  0.2814906 = queryWeight(field:gesetz), product of:
>>>    0.30685282 = idf(docFreq=1, maxDocs=1)
>>>    0.9173473 = queryNorm
>>>  0.11506981 = (MATCH) fieldWeight(field:gesetz in 0), product of:
>>>    1.0 = tf(termFreq(field:gesetz)=1)
>>>    0.30685282 = idf(docFreq=1, maxDocs=1)
>>>    0.375 = fieldNorm(field=field, doc=0)
>>>
>>>
>>>
>>>
>>> On Wed, Oct 21, 2009 at 1:40 PM, Benjamin Douglas
>>> <bbdouglas@basistech.com>wrote:
>>>
>>> Thanks for all of the answers so far!
>>>>
>>>> Paul's question is similar to another aspect I am curious about:
>>>>
>>>> Given the way the sample word is analyzed, is there anything in the
>>>> scoring
>>>> mechanism that would rank "überwachungsgesetz" higher than
>>>> "gesetzüberwachung" or "fleischgesetz"?
>>>>
>>>>
>>>>
>>> --
>>> Robert Muir
>>> rcmuir@gmail.com
>>>
>>
>>
>
>
> -- 
> Robert Muir
> rcmuir@gmail.com


Mime
View raw message