Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (nike.apache.org: domain of rcmuir@gmail.com designates
 209.85.216.193 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:from:date:message-id:subject:to
         :content-type;
        b=Mz1NM37ljOVTvyRxwJNaNHRetW/V0ilZcofu68X1SN2rV1XlXxKtS02icacvZTJhdD
         Oc63LjA05rTAivEo6a3piZKC3uW9Dnn6Aw0PO6+9dUj5VEGPBURaUyPsoDBnCYMxS0/e
         sdwMOh1lctwmd8FWxP/LbsFfSKD1n9YliDkhc=
MIME-Version: 1.0
In-Reply-To: 
 <CAE8639A6FAB3D4AA7973D5897B325185C7ED8C6E0@MSG-BOX.basistech.net>
References: <CAE8639A6FAB3D4AA7973D5897B325185C7ED8C67B@MSG-BOX.basistech.net>
	<8f0ad1f30910201900t294f64c2v6d10b2ef504666b8@mail.gmail.com>
	<B607F3A6-5E02-4C6B-A8B2-BC4124802AE3@activemath.org>
 <8f0ad1f30910210512m5b0e3359yd161a569de67b191@mail.gmail.com>
	<CAE8639A6FAB3D4AA7973D5897B325185C7ED8C6C1@MSG-BOX.basistech.net>
	<8f0ad1f30910211148v3d4ab1b6u6b7dadcecc95ab86@mail.gmail.com>
	<CAE8639A6FAB3D4AA7973D5897B325185C7ED8C6E0@MSG-BOX.basistech.net>
From: Robert Muir <rcmuir@gmail.com>
Date: Wed, 21 Oct 2009 15:17:33 -0400
Message-ID: <8f0ad1f30910211217j3c1c1cd1k9414bec1d0d865ae@mail.gmail.com>
Subject: Re: Using org.apache.lucene.analysis.compound
To: java-user@lucene.apache.org
Content-Type: multipart/alternative; boundary=001636b14888dddc3a047676d50a

--001636b14888dddc3a047676d50a
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

just add them to the dictionary, the compound filter will do this
automatically.

if you want to tweak it even further, you can also tell compounds to NOT
emit the subwords if they form a bigger compound with the onlyLongestMatch
parameter i spoke of earlier.
I haven't played with this option much but I think this is what its suppose=
d
to do:

if the dictionary is
soft
ball
softball

then "softball" (or compounds containing it) won't emit "soft" and "ball",
because "softball" is in the dictionary and its a longest match.
with the option off, you'd get softball, ball, soft

On Wed, Oct 21, 2009 at 3:09 PM, Benjamin Douglas
<bbdouglas@basistech.com>wrote:

> OK, that makes sense. So I just need to add all of the sub-compounds that
> are real words at posIncr=3D0, even if they are combinations of other
> sub-compounds.
>
> Thanks!
>
> -----Original Message-----
> From: Robert Muir [mailto:rcmuir@gmail.com]
> Sent: Wednesday, October 21, 2009 11:49 AM
> To: java-user@lucene.apache.org
> Subject: Re: Using org.apache.lucene.analysis.compound
>
> yes, your dictionary :)
>
> if =C3=BCberwachungsgesetz is a real word, add it to your dictionary.
>
> for example, if your dictionary is { "Rind", "Fleisch", "Draht", "Schere"=
,
> "Gesetz", "Aufgabe", "=C3=9Cberwachung" }, and you index
> Rindfleisch=C3=BCberwachungsgesetz, then all 3 queries will have the same=
 score.
> but if you expand the dictionary to { "Rind", "Fleisch", "Draht", "Schere=
",
> "Gesetz", "Aufgabe", "=C3=9Cberwachung", "=C3=9Cberwachungsgesetz" }, the=
n this makes
> a big difference.
>
> all 3 queries will still match, but =C3=BCberwachungsgesetz will have a h=
igher
> score. this is because now things are analyzed differently:
> Rindfleisch=C3=BCberwachungsgesetz will be decompounded as before, but wi=
th an
> additional token: =C3=9Cberwachungsgesetz.
> so back to your original question, these 'concatenations' of multiple
> components, yes compounds will do that, if they are real words. but it
> won't
> just make them up.
>
> "=C3=BCberwachungsgesetz"
> 0.23013961 =3D (MATCH) sum of:
>  0.057534903 =3D (MATCH) weight(field:=C3=BCberwachungsgesetz in 0), prod=
uct of:
>    0.5 =3D queryWeight(field:=C3=BCberwachungsgesetz), product of:
>      0.30685282 =3D idf(docFreq=3D1, maxDocs=3D1)
>      1.6294457 =3D queryNorm
>    0.11506981 =3D (MATCH) fieldWeight(field:=C3=BCberwachungsgesetz in 0)=
, product
> of:
>      1.0 =3D tf(termFreq(field:=C3=BCberwachungsgesetz)=3D1)
>      0.30685282 =3D idf(docFreq=3D1, maxDocs=3D1)
>      0.375 =3D fieldNorm(field=3Dfield, doc=3D0)
>  0.057534903 =3D (MATCH) weight(field:=C3=BCberwachung in 0), product of:
>    0.5 =3D queryWeight(field:=C3=BCberwachung), product of:
>      0.30685282 =3D idf(docFreq=3D1, maxDocs=3D1)
>      1.6294457 =3D queryNorm
>    0.11506981 =3D (MATCH) fieldWeight(field:=C3=BCberwachung in 0), produ=
ct of:
>      1.0 =3D tf(termFreq(field:=C3=BCberwachung)=3D1)
>      0.30685282 =3D idf(docFreq=3D1, maxDocs=3D1)
>      0.375 =3D fieldNorm(field=3Dfield, doc=3D0)
>  0.057534903 =3D (MATCH) weight(field:=C3=BCberwachungsgesetz in 0), prod=
uct of:
>    0.5 =3D queryWeight(field:=C3=BCberwachungsgesetz), product of:
>      0.30685282 =3D idf(docFreq=3D1, maxDocs=3D1)
>      1.6294457 =3D queryNorm
>    0.11506981 =3D (MATCH) fieldWeight(field:=C3=BCberwachungsgesetz in 0)=
, product
> of:
>      1.0 =3D tf(termFreq(field:=C3=BCberwachungsgesetz)=3D1)
>      0.30685282 =3D idf(docFreq=3D1, maxDocs=3D1)
>      0.375 =3D fieldNorm(field=3Dfield, doc=3D0)
>  0.057534903 =3D (MATCH) weight(field:gesetz in 0), product of:
>    0.5 =3D queryWeight(field:gesetz), product of:
>      0.30685282 =3D idf(docFreq=3D1, maxDocs=3D1)
>      1.6294457 =3D queryNorm
>    0.11506981 =3D (MATCH) fieldWeight(field:gesetz in 0), product of:
>      1.0 =3D tf(termFreq(field:gesetz)=3D1)
>      0.30685282 =3D idf(docFreq=3D1, maxDocs=3D1)
>      0.375 =3D fieldNorm(field=3Dfield, doc=3D0)
>
> "gesetz=C3=BCberwachung"
> 0.064782135 =3D (MATCH) sum of:
>  0.032391068 =3D (MATCH) weight(field:gesetz in 0), product of:
>    0.2814906 =3D queryWeight(field:gesetz), product of:
>      0.30685282 =3D idf(docFreq=3D1, maxDocs=3D1)
>      0.9173473 =3D queryNorm
>    0.11506981 =3D (MATCH) fieldWeight(field:gesetz in 0), product of:
>      1.0 =3D tf(termFreq(field:gesetz)=3D1)
>      0.30685282 =3D idf(docFreq=3D1, maxDocs=3D1)
>      0.375 =3D fieldNorm(field=3Dfield, doc=3D0)
>  0.032391068 =3D (MATCH) weight(field:=C3=BCberwachung in 0), product of:
>    0.2814906 =3D queryWeight(field:=C3=BCberwachung), product of:
>      0.30685282 =3D idf(docFreq=3D1, maxDocs=3D1)
>      0.9173473 =3D queryNorm
>    0.11506981 =3D (MATCH) fieldWeight(field:=C3=BCberwachung in 0), produ=
ct of:
>      1.0 =3D tf(termFreq(field:=C3=BCberwachung)=3D1)
>      0.30685282 =3D idf(docFreq=3D1, maxDocs=3D1)
>      0.375 =3D fieldNorm(field=3Dfield, doc=3D0)
>
> "fleischgesetz"
> 0.064782135 =3D (MATCH) sum of:
>  0.032391068 =3D (MATCH) weight(field:fleisch in 0), product of:
>    0.2814906 =3D queryWeight(field:fleisch), product of:
>      0.30685282 =3D idf(docFreq=3D1, maxDocs=3D1)
>      0.9173473 =3D queryNorm
>    0.11506981 =3D (MATCH) fieldWeight(field:fleisch in 0), product of:
>      1.0 =3D tf(termFreq(field:fleisch)=3D1)
>      0.30685282 =3D idf(docFreq=3D1, maxDocs=3D1)
>      0.375 =3D fieldNorm(field=3Dfield, doc=3D0)
>  0.032391068 =3D (MATCH) weight(field:gesetz in 0), product of:
>    0.2814906 =3D queryWeight(field:gesetz), product of:
>      0.30685282 =3D idf(docFreq=3D1, maxDocs=3D1)
>      0.9173473 =3D queryNorm
>    0.11506981 =3D (MATCH) fieldWeight(field:gesetz in 0), product of:
>      1.0 =3D tf(termFreq(field:gesetz)=3D1)
>      0.30685282 =3D idf(docFreq=3D1, maxDocs=3D1)
>      0.375 =3D fieldNorm(field=3Dfield, doc=3D0)
>
>
>
>
> On Wed, Oct 21, 2009 at 1:40 PM, Benjamin Douglas
> <bbdouglas@basistech.com>wrote:
>
> > Thanks for all of the answers so far!
> >
> > Paul's question is similar to another aspect I am curious about:
> >
> > Given the way the sample word is analyzed, is there anything in the
> scoring
> > mechanism that would rank "=C3=BCberwachungsgesetz" higher than
> > "gesetz=C3=BCberwachung" or "fleischgesetz"?
> >
> >
>
> --
> Robert Muir
> rcmuir@gmail.com
>


--=20
Robert Muir
rcmuir@gmail.com

--001636b14888dddc3a047676d50a--