Return-Path: Delivered-To: apmail-lucene-solr-user-archive@locus.apache.org Received: (qmail 88177 invoked from network); 16 Oct 2007 16:53:33 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 16 Oct 2007 16:53:33 -0000 Received: (qmail 33984 invoked by uid 500); 16 Oct 2007 16:53:19 -0000 Delivered-To: apmail-lucene-solr-user-archive@lucene.apache.org Received: (qmail 33947 invoked by uid 500); 16 Oct 2007 16:53:19 -0000 Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: solr-user@lucene.apache.org Delivered-To: mailing list solr-user@lucene.apache.org Received: (qmail 33938 invoked by uid 99); 16 Oct 2007 16:53:19 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 16 Oct 2007 09:53:19 -0700 X-ASF-Spam-Status: No, hits=1.8 required=10.0 tests=MIME_QP_LONG_LINE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [80.190.253.131] (HELO mail.baseserver.net) (80.190.253.131) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 16 Oct 2007 16:53:20 +0000 Received: from [192.168.2.226] (unknown [195.177.48.115]) by mail.baseserver.net (Postfix) with ESMTP id 7F2139B359 for ; Tue, 16 Oct 2007 18:52:55 +0200 (CEST) Subject: Re: Different search results for (german) singular/plural searches - looking for a solution From: Martin Grotzke To: solr-user@lucene.apache.org In-Reply-To: <470E4218.4070403@kabuco.de> References: <1192010412.3422.51.camel@localhost.localdomain.tld> <470CA96A.4020504@kabuco.de> <1192111522.3404.126.camel@localhost.localdomain.tld> <470E4218.4070403@kabuco.de> Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="=-x2+37HCDvh1+dOuHO0Iq" Date: Tue, 16 Oct 2007 18:52:53 +0200 Message-Id: <1192553573.8991.59.camel@localhost.localdomain.tld> Mime-Version: 1.0 X-Mailer: Evolution 2.8.3 (2.8.3-2.fc6) X-Virus-Checked: Checked by ClamAV on apache.org --=-x2+37HCDvh1+dOuHO0Iq Content-Type: text/plain; charset=ISO-8859-15 Content-Transfer-Encoding: quoted-printable Hi, now I played around with the snowball porter stemmer and it definitely feels really good (used German2 as suggested). For some cases (e.g. product types like top/tops, bermuda/bermudas or hoody/hoodies) additionally we need synonyms. At first I thought it would be good to use synonyms only at query time, but the docs in the wiki recommend to expand synonyms at index time... What are your experiences? Would you also suggest to use them when indexing? On Thu, 2007-10-11 at 17:32 +0200, Thomas Traeger wrote: > Martin Grotzke schrieb: > >> Try the SnowballPorterFilterFactory with German2 as language attribute= =20 > >> first and use synonyms for combined words i.e. "Herrenhose" =3D> "Herr= en",=20 > >> "Hose". > >> =20 > > so you use a combined approach? > > =20 > Yes, we define the relevant parts of compounded words (keywords only) as=20 > synonyms and feed them in a special field that is used for searching and=20 > for the product index.=20 So you don't use a single catchall field "text"? What is the reason for this, what is the advantage? > I hope there will be a filter that can split=20 > compounded word sometimes in the future... There is no standard approach for handling this problem apart from synonyms? This is exactly what jwordsplitter does (as posted by Daniel)... Thanx && cheers, Martin > >> By using stemming you will maybe have some "interesting" results, but = it=20 > >> is much better living with them than having no or much less results ;o= ) > >> =20 > > Do you have an example what "interesting" results I can expect, just to > > get an idea? > > =20 > >> Find more infos on the Snowball stemming algorithms here: > >> > >> http://snowball.tartarus.org/ > >> =20 > > Thanx! I also had a look at this site already, but what is missing is a > > demo where one can see what's happening. I think I'll play a little wit= h > > stemming to get a feeling for this. > > =20 > I think the Snowball stemmer is very good so I have no practical example=20 > for you. Maybe this is of value to see what happens: >=20 > http://snowball.tartarus.org/algorithms/german/diffs.txt >=20 > If you have mixed languages in your content, which sometimes happens in=20 > product data, you might get into some trouble. >=20 > >> Also have a look at the StopFilterFactory, here is a sample stopwordli= st=20 > >> for the german language: > >> > >> http://snowball.tartarus.org/algorithms/german/stop.txt > >> =20 > > Our application handles products, do you think such stopwords are usefu= l > > in this scenario also? I wouldn't expect a user to search for "keine > > hose" or s.th. like this :) > > =20 > I have seen much worse queries, so you never know ;o) >=20 > think of a query like this: "Hose in blau f=FCr Herren" >=20 > You will definetly want to remove "in" and "f=FCr" during searching and i= t=20 > reduces index size when removed during indexing. Maybe you will even get=20 > better scores when only relevant terms are used. You should optimze the=20 > stopword list based on your data. >=20 > Regards, >=20 > Tom >=20 --=20 Martin Grotzke http://www.javakaffee.de/blog/ --=-x2+37HCDvh1+dOuHO0Iq Content-Type: application/pgp-signature; name=signature.asc Content-Description: This is a digitally signed message part -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (GNU/Linux) iD8DBQBHFOxl7FvOl7Te+pYRAsefAKCxpF6sAV+8qYAzArky+cPMckYkjACghY9l qJt48Z5FxaR9rrprOr8+z+0= =V/Jq -----END PGP SIGNATURE----- --=-x2+37HCDvh1+dOuHO0Iq--