lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <rcm...@gmail.com>
Subject Re: Using org.apache.lucene.analysis.compound
Date Wed, 21 Oct 2009 19:17:33 GMT
just add them to the dictionary, the compound filter will do this
automatically.

if you want to tweak it even further, you can also tell compounds to NOT
emit the subwords if they form a bigger compound with the onlyLongestMatch
parameter i spoke of earlier.
I haven't played with this option much but I think this is what its supposed
to do:

if the dictionary is
soft
ball
softball

then "softball" (or compounds containing it) won't emit "soft" and "ball",
because "softball" is in the dictionary and its a longest match.
with the option off, you'd get softball, ball, soft

On Wed, Oct 21, 2009 at 3:09 PM, Benjamin Douglas
<bbdouglas@basistech.com>wrote:

> OK, that makes sense. So I just need to add all of the sub-compounds that
> are real words at posIncr=0, even if they are combinations of other
> sub-compounds.
>
> Thanks!
>
> -----Original Message-----
> From: Robert Muir [mailto:rcmuir@gmail.com]
> Sent: Wednesday, October 21, 2009 11:49 AM
> To: java-user@lucene.apache.org
> Subject: Re: Using org.apache.lucene.analysis.compound
>
> yes, your dictionary :)
>
> if überwachungsgesetz is a real word, add it to your dictionary.
>
> for example, if your dictionary is { "Rind", "Fleisch", "Draht", "Schere",
> "Gesetz", "Aufgabe", "Überwachung" }, and you index
> Rindfleischüberwachungsgesetz, then all 3 queries will have the same score.
> but if you expand the dictionary to { "Rind", "Fleisch", "Draht", "Schere",
> "Gesetz", "Aufgabe", "Überwachung", "Überwachungsgesetz" }, then this makes
> a big difference.
>
> all 3 queries will still match, but überwachungsgesetz will have a higher
> score. this is because now things are analyzed differently:
> Rindfleischüberwachungsgesetz will be decompounded as before, but with an
> additional token: Überwachungsgesetz.
> so back to your original question, these 'concatenations' of multiple
> components, yes compounds will do that, if they are real words. but it
> won't
> just make them up.
>
> "überwachungsgesetz"
> 0.23013961 = (MATCH) sum of:
>  0.057534903 = (MATCH) weight(field:überwachungsgesetz in 0), product of:
>    0.5 = queryWeight(field:überwachungsgesetz), product of:
>      0.30685282 = idf(docFreq=1, maxDocs=1)
>      1.6294457 = queryNorm
>    0.11506981 = (MATCH) fieldWeight(field:überwachungsgesetz in 0), product
> of:
>      1.0 = tf(termFreq(field:überwachungsgesetz)=1)
>      0.30685282 = idf(docFreq=1, maxDocs=1)
>      0.375 = fieldNorm(field=field, doc=0)
>  0.057534903 = (MATCH) weight(field:überwachung in 0), product of:
>    0.5 = queryWeight(field:überwachung), product of:
>      0.30685282 = idf(docFreq=1, maxDocs=1)
>      1.6294457 = queryNorm
>    0.11506981 = (MATCH) fieldWeight(field:überwachung in 0), product of:
>      1.0 = tf(termFreq(field:überwachung)=1)
>      0.30685282 = idf(docFreq=1, maxDocs=1)
>      0.375 = fieldNorm(field=field, doc=0)
>  0.057534903 = (MATCH) weight(field:überwachungsgesetz in 0), product of:
>    0.5 = queryWeight(field:überwachungsgesetz), product of:
>      0.30685282 = idf(docFreq=1, maxDocs=1)
>      1.6294457 = queryNorm
>    0.11506981 = (MATCH) fieldWeight(field:überwachungsgesetz in 0), product
> of:
>      1.0 = tf(termFreq(field:überwachungsgesetz)=1)
>      0.30685282 = idf(docFreq=1, maxDocs=1)
>      0.375 = fieldNorm(field=field, doc=0)
>  0.057534903 = (MATCH) weight(field:gesetz in 0), product of:
>    0.5 = queryWeight(field:gesetz), product of:
>      0.30685282 = idf(docFreq=1, maxDocs=1)
>      1.6294457 = queryNorm
>    0.11506981 = (MATCH) fieldWeight(field:gesetz in 0), product of:
>      1.0 = tf(termFreq(field:gesetz)=1)
>      0.30685282 = idf(docFreq=1, maxDocs=1)
>      0.375 = fieldNorm(field=field, doc=0)
>
> "gesetzüberwachung"
> 0.064782135 = (MATCH) sum of:
>  0.032391068 = (MATCH) weight(field:gesetz in 0), product of:
>    0.2814906 = queryWeight(field:gesetz), product of:
>      0.30685282 = idf(docFreq=1, maxDocs=1)
>      0.9173473 = queryNorm
>    0.11506981 = (MATCH) fieldWeight(field:gesetz in 0), product of:
>      1.0 = tf(termFreq(field:gesetz)=1)
>      0.30685282 = idf(docFreq=1, maxDocs=1)
>      0.375 = fieldNorm(field=field, doc=0)
>  0.032391068 = (MATCH) weight(field:überwachung in 0), product of:
>    0.2814906 = queryWeight(field:überwachung), product of:
>      0.30685282 = idf(docFreq=1, maxDocs=1)
>      0.9173473 = queryNorm
>    0.11506981 = (MATCH) fieldWeight(field:überwachung in 0), product of:
>      1.0 = tf(termFreq(field:überwachung)=1)
>      0.30685282 = idf(docFreq=1, maxDocs=1)
>      0.375 = fieldNorm(field=field, doc=0)
>
> "fleischgesetz"
> 0.064782135 = (MATCH) sum of:
>  0.032391068 = (MATCH) weight(field:fleisch in 0), product of:
>    0.2814906 = queryWeight(field:fleisch), product of:
>      0.30685282 = idf(docFreq=1, maxDocs=1)
>      0.9173473 = queryNorm
>    0.11506981 = (MATCH) fieldWeight(field:fleisch in 0), product of:
>      1.0 = tf(termFreq(field:fleisch)=1)
>      0.30685282 = idf(docFreq=1, maxDocs=1)
>      0.375 = fieldNorm(field=field, doc=0)
>  0.032391068 = (MATCH) weight(field:gesetz in 0), product of:
>    0.2814906 = queryWeight(field:gesetz), product of:
>      0.30685282 = idf(docFreq=1, maxDocs=1)
>      0.9173473 = queryNorm
>    0.11506981 = (MATCH) fieldWeight(field:gesetz in 0), product of:
>      1.0 = tf(termFreq(field:gesetz)=1)
>      0.30685282 = idf(docFreq=1, maxDocs=1)
>      0.375 = fieldNorm(field=field, doc=0)
>
>
>
>
> On Wed, Oct 21, 2009 at 1:40 PM, Benjamin Douglas
> <bbdouglas@basistech.com>wrote:
>
> > Thanks for all of the answers so far!
> >
> > Paul's question is similar to another aspect I am curious about:
> >
> > Given the way the sample word is analyzed, is there anything in the
> scoring
> > mechanism that would rank "überwachungsgesetz" higher than
> > "gesetzüberwachung" or "fleischgesetz"?
> >
> >
>
> --
> Robert Muir
> rcmuir@gmail.com
>



-- 
Robert Muir
rcmuir@gmail.com

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message