lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Libbrecht <p...@activemath.org>
Subject Re: Using org.apache.lucene.analysis.compound
Date Wed, 21 Oct 2009 19:16:12 GMT
Can the dictionary have weights?

überwachungsgesetz alone probably needs a higher rank than überwachung  
and gesetzt or?

paul


Le 21-oct.-09 à 21:09, Benjamin Douglas a écrit :

> OK, that makes sense. So I just need to add all of the sub-compounds  
> that are real words at posIncr=0, even if they are combinations of  
> other sub-compounds.
>
> Thanks!
>
> -----Original Message-----
> From: Robert Muir [mailto:rcmuir@gmail.com]
> Sent: Wednesday, October 21, 2009 11:49 AM
> To: java-user@lucene.apache.org
> Subject: Re: Using org.apache.lucene.analysis.compound
>
> yes, your dictionary :)
>
> if überwachungsgesetz is a real word, add it to your dictionary.
>
> for example, if your dictionary is { "Rind", "Fleisch", "Draht",  
> "Schere",
> "Gesetz", "Aufgabe", "Überwachung" }, and you index
> Rindfleischüberwachungsgesetz, then all 3 queries will have the same  
> score.
> but if you expand the dictionary to { "Rind", "Fleisch", "Draht",  
> "Schere",
> "Gesetz", "Aufgabe", "Überwachung", "Überwachungsgesetz" }, then  
> this makes
> a big difference.
>
> all 3 queries will still match, but überwachungsgesetz will have a  
> higher
> score. this is because now things are analyzed differently:
> Rindfleischüberwachungsgesetz will be decompounded as before, but  
> with an
> additional token: Überwachungsgesetz.
> so back to your original question, these 'concatenations' of multiple
> components, yes compounds will do that, if they are real words. but  
> it won't
> just make them up.
>
> "überwachungsgesetz"
> 0.23013961 = (MATCH) sum of:
>  0.057534903 = (MATCH) weight(field:überwachungsgesetz in 0),  
> product of:
>    0.5 = queryWeight(field:überwachungsgesetz), product of:
>      0.30685282 = idf(docFreq=1, maxDocs=1)
>      1.6294457 = queryNorm
>    0.11506981 = (MATCH) fieldWeight(field:überwachungsgesetz in 0),  
> product
> of:
>      1.0 = tf(termFreq(field:überwachungsgesetz)=1)
>      0.30685282 = idf(docFreq=1, maxDocs=1)
>      0.375 = fieldNorm(field=field, doc=0)
>  0.057534903 = (MATCH) weight(field:überwachung in 0), product of:
>    0.5 = queryWeight(field:überwachung), product of:
>      0.30685282 = idf(docFreq=1, maxDocs=1)
>      1.6294457 = queryNorm
>    0.11506981 = (MATCH) fieldWeight(field:überwachung in 0), product  
> of:
>      1.0 = tf(termFreq(field:überwachung)=1)
>      0.30685282 = idf(docFreq=1, maxDocs=1)
>      0.375 = fieldNorm(field=field, doc=0)
>  0.057534903 = (MATCH) weight(field:überwachungsgesetz in 0),  
> product of:
>    0.5 = queryWeight(field:überwachungsgesetz), product of:
>      0.30685282 = idf(docFreq=1, maxDocs=1)
>      1.6294457 = queryNorm
>    0.11506981 = (MATCH) fieldWeight(field:überwachungsgesetz in 0),  
> product
> of:
>      1.0 = tf(termFreq(field:überwachungsgesetz)=1)
>      0.30685282 = idf(docFreq=1, maxDocs=1)
>      0.375 = fieldNorm(field=field, doc=0)
>  0.057534903 = (MATCH) weight(field:gesetz in 0), product of:
>    0.5 = queryWeight(field:gesetz), product of:
>      0.30685282 = idf(docFreq=1, maxDocs=1)
>      1.6294457 = queryNorm
>    0.11506981 = (MATCH) fieldWeight(field:gesetz in 0), product of:
>      1.0 = tf(termFreq(field:gesetz)=1)
>      0.30685282 = idf(docFreq=1, maxDocs=1)
>      0.375 = fieldNorm(field=field, doc=0)
>
> "gesetzüberwachung"
> 0.064782135 = (MATCH) sum of:
>  0.032391068 = (MATCH) weight(field:gesetz in 0), product of:
>    0.2814906 = queryWeight(field:gesetz), product of:
>      0.30685282 = idf(docFreq=1, maxDocs=1)
>      0.9173473 = queryNorm
>    0.11506981 = (MATCH) fieldWeight(field:gesetz in 0), product of:
>      1.0 = tf(termFreq(field:gesetz)=1)
>      0.30685282 = idf(docFreq=1, maxDocs=1)
>      0.375 = fieldNorm(field=field, doc=0)
>  0.032391068 = (MATCH) weight(field:überwachung in 0), product of:
>    0.2814906 = queryWeight(field:überwachung), product of:
>      0.30685282 = idf(docFreq=1, maxDocs=1)
>      0.9173473 = queryNorm
>    0.11506981 = (MATCH) fieldWeight(field:überwachung in 0), product  
> of:
>      1.0 = tf(termFreq(field:überwachung)=1)
>      0.30685282 = idf(docFreq=1, maxDocs=1)
>      0.375 = fieldNorm(field=field, doc=0)
>
> "fleischgesetz"
> 0.064782135 = (MATCH) sum of:
>  0.032391068 = (MATCH) weight(field:fleisch in 0), product of:
>    0.2814906 = queryWeight(field:fleisch), product of:
>      0.30685282 = idf(docFreq=1, maxDocs=1)
>      0.9173473 = queryNorm
>    0.11506981 = (MATCH) fieldWeight(field:fleisch in 0), product of:
>      1.0 = tf(termFreq(field:fleisch)=1)
>      0.30685282 = idf(docFreq=1, maxDocs=1)
>      0.375 = fieldNorm(field=field, doc=0)
>  0.032391068 = (MATCH) weight(field:gesetz in 0), product of:
>    0.2814906 = queryWeight(field:gesetz), product of:
>      0.30685282 = idf(docFreq=1, maxDocs=1)
>      0.9173473 = queryNorm
>    0.11506981 = (MATCH) fieldWeight(field:gesetz in 0), product of:
>      1.0 = tf(termFreq(field:gesetz)=1)
>      0.30685282 = idf(docFreq=1, maxDocs=1)
>      0.375 = fieldNorm(field=field, doc=0)
>
>
>
>
> On Wed, Oct 21, 2009 at 1:40 PM, Benjamin Douglas
> <bbdouglas@basistech.com>wrote:
>
>> Thanks for all of the answers so far!
>>
>> Paul's question is similar to another aspect I am curious about:
>>
>> Given the way the sample word is analyzed, is there anything in the  
>> scoring
>> mechanism that would rank "überwachungsgesetz" higher than
>> "gesetzüberwachung" or "fleischgesetz"?
>>
>>
>
> -- 
> Robert Muir
> rcmuir@gmail.com


Mime
View raw message