lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <rcm...@gmail.com>
Subject Re: Using org.apache.lucene.analysis.compound
Date Wed, 21 Oct 2009 19:23:52 GMT
Paul, i think in general scoring should take care of this too, its all about
your dictionary, same as the previous example.
this is because überwachungsgesetz matches 3 tokens: überwachungsgesetz,
überwachung, gesetz
but überwachung gesetz only matches 2.

überwachungsgesetz
0.37040412 = (MATCH) sum of:
  0.10848885 = (MATCH) weight(field:überwachungsgesetz in 0), product of:
    0.5 = queryWeight(field:überwachungsgesetz), product of:
      0.30685282 = idf(docFreq=1, maxDocs=1)
      1.6294457 = queryNorm
    0.2169777 = (MATCH) fieldWeight(field:überwachungsgesetz in 0), product
of:
      1.4142135 = tf(termFreq(field:überwachungsgesetz)=2)
      0.30685282 = idf(docFreq=1, maxDocs=1)
      0.5 = fieldNorm(field=field, doc=0)
  0.076713204 = (MATCH) weight(field:überwachung in 0), product of:
    0.5 = queryWeight(field:überwachung), product of:
      0.30685282 = idf(docFreq=1, maxDocs=1)
      1.6294457 = queryNorm
    0.15342641 = (MATCH) fieldWeight(field:überwachung in 0), product of:
      1.0 = tf(termFreq(field:überwachung)=1)
      0.30685282 = idf(docFreq=1, maxDocs=1)
      0.5 = fieldNorm(field=field, doc=0)
  0.10848885 = (MATCH) weight(field:überwachungsgesetz in 0), product of:
    0.5 = queryWeight(field:überwachungsgesetz), product of:
      0.30685282 = idf(docFreq=1, maxDocs=1)
      1.6294457 = queryNorm
    0.2169777 = (MATCH) fieldWeight(field:überwachungsgesetz in 0), product
of:
      1.4142135 = tf(termFreq(field:überwachungsgesetz)=2)
      0.30685282 = idf(docFreq=1, maxDocs=1)
      0.5 = fieldNorm(field=field, doc=0)
  0.076713204 = (MATCH) weight(field:gesetz in 0), product of:
    0.5 = queryWeight(field:gesetz), product of:
      0.30685282 = idf(docFreq=1, maxDocs=1)
      1.6294457 = queryNorm
    0.15342641 = (MATCH) fieldWeight(field:gesetz in 0), product of:
      1.0 = tf(termFreq(field:gesetz)=1)
      0.30685282 = idf(docFreq=1, maxDocs=1)
      0.5 = fieldNorm(field=field, doc=0)

überwachung gesetz
0.30685282 = (MATCH) sum of:
  0.15342641 = (MATCH) sum of:
    0.076713204 = (MATCH) weight(field:überwachung in 0), product of:
      0.5 = queryWeight(field:überwachung), product of:
        0.30685282 = idf(docFreq=1, maxDocs=1)
        1.6294457 = queryNorm
      0.15342641 = (MATCH) fieldWeight(field:überwachung in 0), product of:
        1.0 = tf(termFreq(field:überwachung)=1)
        0.30685282 = idf(docFreq=1, maxDocs=1)
        0.5 = fieldNorm(field=field, doc=0)
    0.076713204 = (MATCH) weight(field:überwachung in 0), product of:
      0.5 = queryWeight(field:überwachung), product of:
        0.30685282 = idf(docFreq=1, maxDocs=1)
        1.6294457 = queryNorm
      0.15342641 = (MATCH) fieldWeight(field:überwachung in 0), product of:
        1.0 = tf(termFreq(field:überwachung)=1)
        0.30685282 = idf(docFreq=1, maxDocs=1)
        0.5 = fieldNorm(field=field, doc=0)
  0.15342641 = (MATCH) sum of:
    0.076713204 = (MATCH) weight(field:gesetz in 0), product of:
      0.5 = queryWeight(field:gesetz), product of:
        0.30685282 = idf(docFreq=1, maxDocs=1)
        1.6294457 = queryNorm
      0.15342641 = (MATCH) fieldWeight(field:gesetz in 0), product of:
        1.0 = tf(termFreq(field:gesetz)=1)
        0.30685282 = idf(docFreq=1, maxDocs=1)
        0.5 = fieldNorm(field=field, doc=0)
    0.076713204 = (MATCH) weight(field:gesetz in 0), product of:
      0.5 = queryWeight(field:gesetz), product of:
        0.30685282 = idf(docFreq=1, maxDocs=1)
        1.6294457 = queryNorm
      0.15342641 = (MATCH) fieldWeight(field:gesetz in 0), product of:
        1.0 = tf(termFreq(field:gesetz)=1)
        0.30685282 = idf(docFreq=1, maxDocs=1)
        0.5 = fieldNorm(field=field, doc=0)

On Wed, Oct 21, 2009 at 3:16 PM, Paul Libbrecht <paul@activemath.org> wrote:

> Can the dictionary have weights?
>
> überwachungsgesetz alone probably needs a higher rank than überwachung and
> gesetzt or?
>
> paul
>
>
> Le 21-oct.-09 à 21:09, Benjamin Douglas a écrit :
>
>
>  OK, that makes sense. So I just need to add all of the sub-compounds that
>> are real words at posIncr=0, even if they are combinations of other
>> sub-compounds.
>>
>> Thanks!
>>
>> -----Original Message-----
>> From: Robert Muir [mailto:rcmuir@gmail.com]
>> Sent: Wednesday, October 21, 2009 11:49 AM
>> To: java-user@lucene.apache.org
>> Subject: Re: Using org.apache.lucene.analysis.compound
>>
>> yes, your dictionary :)
>>
>> if überwachungsgesetz is a real word, add it to your dictionary.
>>
>> for example, if your dictionary is { "Rind", "Fleisch", "Draht", "Schere",
>> "Gesetz", "Aufgabe", "Überwachung" }, and you index
>> Rindfleischüberwachungsgesetz, then all 3 queries will have the same
>> score.
>> but if you expand the dictionary to { "Rind", "Fleisch", "Draht",
>> "Schere",
>> "Gesetz", "Aufgabe", "Überwachung", "Überwachungsgesetz" }, then this
>> makes
>> a big difference.
>>
>> all 3 queries will still match, but überwachungsgesetz will have a higher
>> score. this is because now things are analyzed differently:
>> Rindfleischüberwachungsgesetz will be decompounded as before, but with an
>> additional token: Überwachungsgesetz.
>> so back to your original question, these 'concatenations' of multiple
>> components, yes compounds will do that, if they are real words. but it
>> won't
>> just make them up.
>>
>> "überwachungsgesetz"
>> 0.23013961 = (MATCH) sum of:
>>  0.057534903 = (MATCH) weight(field:überwachungsgesetz in 0), product of:
>>   0.5 = queryWeight(field:überwachungsgesetz), product of:
>>     0.30685282 = idf(docFreq=1, maxDocs=1)
>>     1.6294457 = queryNorm
>>   0.11506981 = (MATCH) fieldWeight(field:überwachungsgesetz in 0), product
>> of:
>>     1.0 = tf(termFreq(field:überwachungsgesetz)=1)
>>     0.30685282 = idf(docFreq=1, maxDocs=1)
>>     0.375 = fieldNorm(field=field, doc=0)
>>  0.057534903 = (MATCH) weight(field:überwachung in 0), product of:
>>   0.5 = queryWeight(field:überwachung), product of:
>>     0.30685282 = idf(docFreq=1, maxDocs=1)
>>     1.6294457 = queryNorm
>>   0.11506981 = (MATCH) fieldWeight(field:überwachung in 0), product of:
>>     1.0 = tf(termFreq(field:überwachung)=1)
>>     0.30685282 = idf(docFreq=1, maxDocs=1)
>>     0.375 = fieldNorm(field=field, doc=0)
>>  0.057534903 = (MATCH) weight(field:überwachungsgesetz in 0), product of:
>>   0.5 = queryWeight(field:überwachungsgesetz), product of:
>>     0.30685282 = idf(docFreq=1, maxDocs=1)
>>     1.6294457 = queryNorm
>>   0.11506981 = (MATCH) fieldWeight(field:überwachungsgesetz in 0), product
>> of:
>>     1.0 = tf(termFreq(field:überwachungsgesetz)=1)
>>     0.30685282 = idf(docFreq=1, maxDocs=1)
>>     0.375 = fieldNorm(field=field, doc=0)
>>  0.057534903 = (MATCH) weight(field:gesetz in 0), product of:
>>   0.5 = queryWeight(field:gesetz), product of:
>>     0.30685282 = idf(docFreq=1, maxDocs=1)
>>     1.6294457 = queryNorm
>>   0.11506981 = (MATCH) fieldWeight(field:gesetz in 0), product of:
>>     1.0 = tf(termFreq(field:gesetz)=1)
>>     0.30685282 = idf(docFreq=1, maxDocs=1)
>>     0.375 = fieldNorm(field=field, doc=0)
>>
>> "gesetzüberwachung"
>> 0.064782135 = (MATCH) sum of:
>>  0.032391068 = (MATCH) weight(field:gesetz in 0), product of:
>>   0.2814906 = queryWeight(field:gesetz), product of:
>>     0.30685282 = idf(docFreq=1, maxDocs=1)
>>     0.9173473 = queryNorm
>>   0.11506981 = (MATCH) fieldWeight(field:gesetz in 0), product of:
>>     1.0 = tf(termFreq(field:gesetz)=1)
>>     0.30685282 = idf(docFreq=1, maxDocs=1)
>>     0.375 = fieldNorm(field=field, doc=0)
>>  0.032391068 = (MATCH) weight(field:überwachung in 0), product of:
>>   0.2814906 = queryWeight(field:überwachung), product of:
>>     0.30685282 = idf(docFreq=1, maxDocs=1)
>>     0.9173473 = queryNorm
>>   0.11506981 = (MATCH) fieldWeight(field:überwachung in 0), product of:
>>     1.0 = tf(termFreq(field:überwachung)=1)
>>     0.30685282 = idf(docFreq=1, maxDocs=1)
>>     0.375 = fieldNorm(field=field, doc=0)
>>
>> "fleischgesetz"
>> 0.064782135 = (MATCH) sum of:
>>  0.032391068 = (MATCH) weight(field:fleisch in 0), product of:
>>   0.2814906 = queryWeight(field:fleisch), product of:
>>     0.30685282 = idf(docFreq=1, maxDocs=1)
>>     0.9173473 = queryNorm
>>   0.11506981 = (MATCH) fieldWeight(field:fleisch in 0), product of:
>>     1.0 = tf(termFreq(field:fleisch)=1)
>>     0.30685282 = idf(docFreq=1, maxDocs=1)
>>     0.375 = fieldNorm(field=field, doc=0)
>>  0.032391068 = (MATCH) weight(field:gesetz in 0), product of:
>>   0.2814906 = queryWeight(field:gesetz), product of:
>>     0.30685282 = idf(docFreq=1, maxDocs=1)
>>     0.9173473 = queryNorm
>>   0.11506981 = (MATCH) fieldWeight(field:gesetz in 0), product of:
>>     1.0 = tf(termFreq(field:gesetz)=1)
>>     0.30685282 = idf(docFreq=1, maxDocs=1)
>>     0.375 = fieldNorm(field=field, doc=0)
>>
>>
>>
>>
>> On Wed, Oct 21, 2009 at 1:40 PM, Benjamin Douglas
>> <bbdouglas@basistech.com>wrote:
>>
>>  Thanks for all of the answers so far!
>>>
>>> Paul's question is similar to another aspect I am curious about:
>>>
>>> Given the way the sample word is analyzed, is there anything in the
>>> scoring
>>> mechanism that would rank "überwachungsgesetz" higher than
>>> "gesetzüberwachung" or "fleischgesetz"?
>>>
>>>
>>>
>> --
>> Robert Muir
>> rcmuir@gmail.com
>>
>
>


-- 
Robert Muir
rcmuir@gmail.com

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message