lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <rcm...@gmail.com>
Subject Re: Using org.apache.lucene.analysis.compound
Date Wed, 21 Oct 2009 18:48:57 GMT
yes, your dictionary :)

if überwachungsgesetz is a real word, add it to your dictionary.

for example, if your dictionary is { "Rind", "Fleisch", "Draht", "Schere",
"Gesetz", "Aufgabe", "Überwachung" }, and you index
Rindfleischüberwachungsgesetz, then all 3 queries will have the same score.
but if you expand the dictionary to { "Rind", "Fleisch", "Draht", "Schere",
"Gesetz", "Aufgabe", "Überwachung", "Überwachungsgesetz" }, then this makes
a big difference.

all 3 queries will still match, but überwachungsgesetz will have a higher
score. this is because now things are analyzed differently:
Rindfleischüberwachungsgesetz will be decompounded as before, but with an
additional token: Überwachungsgesetz.
so back to your original question, these 'concatenations' of multiple
components, yes compounds will do that, if they are real words. but it won't
just make them up.

"überwachungsgesetz"
0.23013961 = (MATCH) sum of:
  0.057534903 = (MATCH) weight(field:überwachungsgesetz in 0), product of:
    0.5 = queryWeight(field:überwachungsgesetz), product of:
      0.30685282 = idf(docFreq=1, maxDocs=1)
      1.6294457 = queryNorm
    0.11506981 = (MATCH) fieldWeight(field:überwachungsgesetz in 0), product
of:
      1.0 = tf(termFreq(field:überwachungsgesetz)=1)
      0.30685282 = idf(docFreq=1, maxDocs=1)
      0.375 = fieldNorm(field=field, doc=0)
  0.057534903 = (MATCH) weight(field:überwachung in 0), product of:
    0.5 = queryWeight(field:überwachung), product of:
      0.30685282 = idf(docFreq=1, maxDocs=1)
      1.6294457 = queryNorm
    0.11506981 = (MATCH) fieldWeight(field:überwachung in 0), product of:
      1.0 = tf(termFreq(field:überwachung)=1)
      0.30685282 = idf(docFreq=1, maxDocs=1)
      0.375 = fieldNorm(field=field, doc=0)
  0.057534903 = (MATCH) weight(field:überwachungsgesetz in 0), product of:
    0.5 = queryWeight(field:überwachungsgesetz), product of:
      0.30685282 = idf(docFreq=1, maxDocs=1)
      1.6294457 = queryNorm
    0.11506981 = (MATCH) fieldWeight(field:überwachungsgesetz in 0), product
of:
      1.0 = tf(termFreq(field:überwachungsgesetz)=1)
      0.30685282 = idf(docFreq=1, maxDocs=1)
      0.375 = fieldNorm(field=field, doc=0)
  0.057534903 = (MATCH) weight(field:gesetz in 0), product of:
    0.5 = queryWeight(field:gesetz), product of:
      0.30685282 = idf(docFreq=1, maxDocs=1)
      1.6294457 = queryNorm
    0.11506981 = (MATCH) fieldWeight(field:gesetz in 0), product of:
      1.0 = tf(termFreq(field:gesetz)=1)
      0.30685282 = idf(docFreq=1, maxDocs=1)
      0.375 = fieldNorm(field=field, doc=0)

"gesetzüberwachung"
0.064782135 = (MATCH) sum of:
  0.032391068 = (MATCH) weight(field:gesetz in 0), product of:
    0.2814906 = queryWeight(field:gesetz), product of:
      0.30685282 = idf(docFreq=1, maxDocs=1)
      0.9173473 = queryNorm
    0.11506981 = (MATCH) fieldWeight(field:gesetz in 0), product of:
      1.0 = tf(termFreq(field:gesetz)=1)
      0.30685282 = idf(docFreq=1, maxDocs=1)
      0.375 = fieldNorm(field=field, doc=0)
  0.032391068 = (MATCH) weight(field:überwachung in 0), product of:
    0.2814906 = queryWeight(field:überwachung), product of:
      0.30685282 = idf(docFreq=1, maxDocs=1)
      0.9173473 = queryNorm
    0.11506981 = (MATCH) fieldWeight(field:überwachung in 0), product of:
      1.0 = tf(termFreq(field:überwachung)=1)
      0.30685282 = idf(docFreq=1, maxDocs=1)
      0.375 = fieldNorm(field=field, doc=0)

"fleischgesetz"
0.064782135 = (MATCH) sum of:
  0.032391068 = (MATCH) weight(field:fleisch in 0), product of:
    0.2814906 = queryWeight(field:fleisch), product of:
      0.30685282 = idf(docFreq=1, maxDocs=1)
      0.9173473 = queryNorm
    0.11506981 = (MATCH) fieldWeight(field:fleisch in 0), product of:
      1.0 = tf(termFreq(field:fleisch)=1)
      0.30685282 = idf(docFreq=1, maxDocs=1)
      0.375 = fieldNorm(field=field, doc=0)
  0.032391068 = (MATCH) weight(field:gesetz in 0), product of:
    0.2814906 = queryWeight(field:gesetz), product of:
      0.30685282 = idf(docFreq=1, maxDocs=1)
      0.9173473 = queryNorm
    0.11506981 = (MATCH) fieldWeight(field:gesetz in 0), product of:
      1.0 = tf(termFreq(field:gesetz)=1)
      0.30685282 = idf(docFreq=1, maxDocs=1)
      0.375 = fieldNorm(field=field, doc=0)




On Wed, Oct 21, 2009 at 1:40 PM, Benjamin Douglas
<bbdouglas@basistech.com>wrote:

> Thanks for all of the answers so far!
>
> Paul's question is similar to another aspect I am curious about:
>
> Given the way the sample word is analyzed, is there anything in the scoring
> mechanism that would rank "überwachungsgesetz" higher than
> "gesetzüberwachung" or "fleischgesetz"?
>
>

-- 
Robert Muir
rcmuir@gmail.com

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message