Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: solr-user@lucene.apache.org
Received-SPF: pass (athena.apache.org: local policy)
Subject: Re: Different search results for (german) singular/plural searches
	-	looking for a solution
From: Martin Grotzke <martin.grotzke@javakaffee.de>
To: solr-user@lucene.apache.org
In-Reply-To: <470E4218.4070403@kabuco.de>
References: <1192010412.3422.51.camel@localhost.localdomain.tld>
	 <470CA96A.4020504@kabuco.de>
	 <1192111522.3404.126.camel@localhost.localdomain.tld>
	 <470E4218.4070403@kabuco.de>
Content-Type: multipart/signed; micalg=pgp-sha1;
 protocol="application/pgp-signature"; boundary="=-x2+37HCDvh1+dOuHO0Iq"
Date: Tue, 16 Oct 2007 18:52:53 +0200
Message-Id: <1192553573.8991.59.camel@localhost.localdomain.tld>
Mime-Version: 1.0

--=-x2+37HCDvh1+dOuHO0Iq
Content-Type: text/plain; charset=ISO-8859-15
Content-Transfer-Encoding: quoted-printable

Hi,

now I played around with the snowball porter stemmer and it definitely
feels really good (used German2 as suggested).

For some cases (e.g. product types like top/tops, bermuda/bermudas or
hoody/hoodies) additionally we need synonyms. At first I thought it
would be good to use synonyms only at query time, but the docs in the
wiki recommend to expand synonyms at index time...

What are your experiences? Would you also suggest to use them when
indexing?

On Thu, 2007-10-11 at 17:32 +0200, Thomas Traeger wrote:
> Martin Grotzke schrieb:
> >> Try the SnowballPorterFilterFactory with German2 as language attribute=
=20
> >> first and use synonyms for combined words i.e. "Herrenhose" =3D> "Herr=
en",=20
> >> "Hose".
> >>    =20
> > so you use a combined approach?
> >  =20
> Yes, we define the relevant parts of compounded words (keywords only) as=20
> synonyms and feed them in a special field that is used for searching and=20
> for the product index.=20
So you don't use a single catchall field "text"? What is the reason for
this, what is the advantage?

> I hope there will be a filter that can split=20
> compounded word sometimes in the future...
There is no standard approach for handling this problem apart from
synonyms?
This is exactly what jwordsplitter does (as posted by Daniel)...


Thanx && cheers,
Martin


> >> By using stemming you will maybe have some "interesting" results, but =
it=20
> >> is much better living with them than having no or much less results ;o=
)
> >>    =20
> > Do you have an example what "interesting" results I can expect, just to
> > get an idea?
> >  =20
> >> Find more infos on the Snowball stemming algorithms here:
> >>
> >> http://snowball.tartarus.org/
> >>    =20
> > Thanx! I also had a look at this site already, but what is missing is a
> > demo where one can see what's happening. I think I'll play a little wit=
h
> > stemming to get a feeling for this.
> >  =20
> I think the Snowball stemmer is very good so I have no practical example=20
> for you. Maybe this is of value to see what happens:
>=20
> http://snowball.tartarus.org/algorithms/german/diffs.txt
>=20
> If you have mixed languages in your content, which sometimes happens in=20
> product data, you might get into some trouble.
>=20
> >> Also have a look at the StopFilterFactory, here is a sample stopwordli=
st=20
> >> for the german language:
> >>
> >> http://snowball.tartarus.org/algorithms/german/stop.txt
> >>    =20
> > Our application handles products, do you think such stopwords are usefu=
l
> > in this scenario also? I wouldn't expect a user to search for "keine
> > hose" or s.th. like this :)
> >  =20
> I have seen much worse queries, so you never know ;o)
>=20
> think of a query like this: "Hose in blau f=FCr Herren"
>=20
> You will definetly want to remove "in" and "f=FCr" during searching and i=
t=20
> reduces index size when removed during indexing. Maybe you will even get=20
> better scores when only relevant terms are used. You should optimze the=20
> stopword list based on your data.
>=20
> Regards,
>=20
> Tom
>=20
--=20
Martin Grotzke
http://www.javakaffee.de/blog/

--=-x2+37HCDvh1+dOuHO0Iq
Content-Type: application/pgp-signature; name=signature.asc
Content-Description: This is a digitally signed message part

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)

iD8DBQBHFOxl7FvOl7Te+pYRAsefAKCxpF6sAV+8qYAzArky+cPMckYkjACghY9l
qJt48Z5FxaR9rrprOr8+z+0=
=V/Jq
-----END PGP SIGNATURE-----

--=-x2+37HCDvh1+dOuHO0Iq--