Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 93566 invoked from network); 21 Oct 2009 19:18:26 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 21 Oct 2009 19:18:25 -0000 Received: (qmail 13139 invoked by uid 500); 21 Oct 2009 19:18:23 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 13079 invoked by uid 500); 21 Oct 2009 19:18:23 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 13069 invoked by uid 99); 21 Oct 2009 19:18:23 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 21 Oct 2009 19:18:23 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of rcmuir@gmail.com designates 209.85.216.193 as permitted sender) Received: from [209.85.216.193] (HELO mail-px0-f193.google.com) (209.85.216.193) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 21 Oct 2009 19:18:14 +0000 Received: by pxi31 with SMTP id 31so812734pxi.20 for ; Wed, 21 Oct 2009 12:17:53 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :from:date:message-id:subject:to:content-type; bh=mJfG44K2wGBOfwT86BDwELVxxNdQc1XXHaI+QEHVKE4=; b=Fi6qBQpKyQMYyxvhV6di17T3qjZ+VkYkfaSy8a+opVHo1bUI+sOC7g1kvm1oQriXxQ ReXHDRGvBkfrV7rhH8UirKwmslK92Sf7BrNeGQTfpRGtvF4LLBrk4zJQpJWjec/t4JUM 3YFIbSGPTx3Am7NX4jR6i7sH4Qr2j0dWjQrQQ= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; b=Mz1NM37ljOVTvyRxwJNaNHRetW/V0ilZcofu68X1SN2rV1XlXxKtS02icacvZTJhdD Oc63LjA05rTAivEo6a3piZKC3uW9Dnn6Aw0PO6+9dUj5VEGPBURaUyPsoDBnCYMxS0/e sdwMOh1lctwmd8FWxP/LbsFfSKD1n9YliDkhc= MIME-Version: 1.0 Received: by 10.114.250.37 with SMTP id x37mr12368815wah.110.1256152673143; Wed, 21 Oct 2009 12:17:53 -0700 (PDT) In-Reply-To: References: <8f0ad1f30910201900t294f64c2v6d10b2ef504666b8@mail.gmail.com> <8f0ad1f30910210512m5b0e3359yd161a569de67b191@mail.gmail.com> <8f0ad1f30910211148v3d4ab1b6u6b7dadcecc95ab86@mail.gmail.com> From: Robert Muir Date: Wed, 21 Oct 2009 15:17:33 -0400 Message-ID: <8f0ad1f30910211217j3c1c1cd1k9414bec1d0d865ae@mail.gmail.com> Subject: Re: Using org.apache.lucene.analysis.compound To: java-user@lucene.apache.org Content-Type: multipart/alternative; boundary=001636b14888dddc3a047676d50a X-Virus-Checked: Checked by ClamAV on apache.org --001636b14888dddc3a047676d50a Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable just add them to the dictionary, the compound filter will do this automatically. if you want to tweak it even further, you can also tell compounds to NOT emit the subwords if they form a bigger compound with the onlyLongestMatch parameter i spoke of earlier. I haven't played with this option much but I think this is what its suppose= d to do: if the dictionary is soft ball softball then "softball" (or compounds containing it) won't emit "soft" and "ball", because "softball" is in the dictionary and its a longest match. with the option off, you'd get softball, ball, soft On Wed, Oct 21, 2009 at 3:09 PM, Benjamin Douglas wrote: > OK, that makes sense. So I just need to add all of the sub-compounds that > are real words at posIncr=3D0, even if they are combinations of other > sub-compounds. > > Thanks! > > -----Original Message----- > From: Robert Muir [mailto:rcmuir@gmail.com] > Sent: Wednesday, October 21, 2009 11:49 AM > To: java-user@lucene.apache.org > Subject: Re: Using org.apache.lucene.analysis.compound > > yes, your dictionary :) > > if =C3=BCberwachungsgesetz is a real word, add it to your dictionary. > > for example, if your dictionary is { "Rind", "Fleisch", "Draht", "Schere"= , > "Gesetz", "Aufgabe", "=C3=9Cberwachung" }, and you index > Rindfleisch=C3=BCberwachungsgesetz, then all 3 queries will have the same= score. > but if you expand the dictionary to { "Rind", "Fleisch", "Draht", "Schere= ", > "Gesetz", "Aufgabe", "=C3=9Cberwachung", "=C3=9Cberwachungsgesetz" }, the= n this makes > a big difference. > > all 3 queries will still match, but =C3=BCberwachungsgesetz will have a h= igher > score. this is because now things are analyzed differently: > Rindfleisch=C3=BCberwachungsgesetz will be decompounded as before, but wi= th an > additional token: =C3=9Cberwachungsgesetz. > so back to your original question, these 'concatenations' of multiple > components, yes compounds will do that, if they are real words. but it > won't > just make them up. > > "=C3=BCberwachungsgesetz" > 0.23013961 =3D (MATCH) sum of: > 0.057534903 =3D (MATCH) weight(field:=C3=BCberwachungsgesetz in 0), prod= uct of: > 0.5 =3D queryWeight(field:=C3=BCberwachungsgesetz), product of: > 0.30685282 =3D idf(docFreq=3D1, maxDocs=3D1) > 1.6294457 =3D queryNorm > 0.11506981 =3D (MATCH) fieldWeight(field:=C3=BCberwachungsgesetz in 0)= , product > of: > 1.0 =3D tf(termFreq(field:=C3=BCberwachungsgesetz)=3D1) > 0.30685282 =3D idf(docFreq=3D1, maxDocs=3D1) > 0.375 =3D fieldNorm(field=3Dfield, doc=3D0) > 0.057534903 =3D (MATCH) weight(field:=C3=BCberwachung in 0), product of: > 0.5 =3D queryWeight(field:=C3=BCberwachung), product of: > 0.30685282 =3D idf(docFreq=3D1, maxDocs=3D1) > 1.6294457 =3D queryNorm > 0.11506981 =3D (MATCH) fieldWeight(field:=C3=BCberwachung in 0), produ= ct of: > 1.0 =3D tf(termFreq(field:=C3=BCberwachung)=3D1) > 0.30685282 =3D idf(docFreq=3D1, maxDocs=3D1) > 0.375 =3D fieldNorm(field=3Dfield, doc=3D0) > 0.057534903 =3D (MATCH) weight(field:=C3=BCberwachungsgesetz in 0), prod= uct of: > 0.5 =3D queryWeight(field:=C3=BCberwachungsgesetz), product of: > 0.30685282 =3D idf(docFreq=3D1, maxDocs=3D1) > 1.6294457 =3D queryNorm > 0.11506981 =3D (MATCH) fieldWeight(field:=C3=BCberwachungsgesetz in 0)= , product > of: > 1.0 =3D tf(termFreq(field:=C3=BCberwachungsgesetz)=3D1) > 0.30685282 =3D idf(docFreq=3D1, maxDocs=3D1) > 0.375 =3D fieldNorm(field=3Dfield, doc=3D0) > 0.057534903 =3D (MATCH) weight(field:gesetz in 0), product of: > 0.5 =3D queryWeight(field:gesetz), product of: > 0.30685282 =3D idf(docFreq=3D1, maxDocs=3D1) > 1.6294457 =3D queryNorm > 0.11506981 =3D (MATCH) fieldWeight(field:gesetz in 0), product of: > 1.0 =3D tf(termFreq(field:gesetz)=3D1) > 0.30685282 =3D idf(docFreq=3D1, maxDocs=3D1) > 0.375 =3D fieldNorm(field=3Dfield, doc=3D0) > > "gesetz=C3=BCberwachung" > 0.064782135 =3D (MATCH) sum of: > 0.032391068 =3D (MATCH) weight(field:gesetz in 0), product of: > 0.2814906 =3D queryWeight(field:gesetz), product of: > 0.30685282 =3D idf(docFreq=3D1, maxDocs=3D1) > 0.9173473 =3D queryNorm > 0.11506981 =3D (MATCH) fieldWeight(field:gesetz in 0), product of: > 1.0 =3D tf(termFreq(field:gesetz)=3D1) > 0.30685282 =3D idf(docFreq=3D1, maxDocs=3D1) > 0.375 =3D fieldNorm(field=3Dfield, doc=3D0) > 0.032391068 =3D (MATCH) weight(field:=C3=BCberwachung in 0), product of: > 0.2814906 =3D queryWeight(field:=C3=BCberwachung), product of: > 0.30685282 =3D idf(docFreq=3D1, maxDocs=3D1) > 0.9173473 =3D queryNorm > 0.11506981 =3D (MATCH) fieldWeight(field:=C3=BCberwachung in 0), produ= ct of: > 1.0 =3D tf(termFreq(field:=C3=BCberwachung)=3D1) > 0.30685282 =3D idf(docFreq=3D1, maxDocs=3D1) > 0.375 =3D fieldNorm(field=3Dfield, doc=3D0) > > "fleischgesetz" > 0.064782135 =3D (MATCH) sum of: > 0.032391068 =3D (MATCH) weight(field:fleisch in 0), product of: > 0.2814906 =3D queryWeight(field:fleisch), product of: > 0.30685282 =3D idf(docFreq=3D1, maxDocs=3D1) > 0.9173473 =3D queryNorm > 0.11506981 =3D (MATCH) fieldWeight(field:fleisch in 0), product of: > 1.0 =3D tf(termFreq(field:fleisch)=3D1) > 0.30685282 =3D idf(docFreq=3D1, maxDocs=3D1) > 0.375 =3D fieldNorm(field=3Dfield, doc=3D0) > 0.032391068 =3D (MATCH) weight(field:gesetz in 0), product of: > 0.2814906 =3D queryWeight(field:gesetz), product of: > 0.30685282 =3D idf(docFreq=3D1, maxDocs=3D1) > 0.9173473 =3D queryNorm > 0.11506981 =3D (MATCH) fieldWeight(field:gesetz in 0), product of: > 1.0 =3D tf(termFreq(field:gesetz)=3D1) > 0.30685282 =3D idf(docFreq=3D1, maxDocs=3D1) > 0.375 =3D fieldNorm(field=3Dfield, doc=3D0) > > > > > On Wed, Oct 21, 2009 at 1:40 PM, Benjamin Douglas > wrote: > > > Thanks for all of the answers so far! > > > > Paul's question is similar to another aspect I am curious about: > > > > Given the way the sample word is analyzed, is there anything in the > scoring > > mechanism that would rank "=C3=BCberwachungsgesetz" higher than > > "gesetz=C3=BCberwachung" or "fleischgesetz"? > > > > > > -- > Robert Muir > rcmuir@gmail.com > --=20 Robert Muir rcmuir@gmail.com --001636b14888dddc3a047676d50a--