Return-Path: Delivered-To: apmail-lucene-solr-user-archive@locus.apache.org Received: (qmail 83628 invoked from network); 6 Oct 2008 20:46:17 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 6 Oct 2008 20:46:17 -0000 Received: (qmail 16347 invoked by uid 500); 6 Oct 2008 20:46:14 -0000 Delivered-To: apmail-lucene-solr-user-archive@lucene.apache.org Received: (qmail 15830 invoked by uid 500); 6 Oct 2008 20:46:13 -0000 Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: solr-user@lucene.apache.org Delivered-To: mailing list solr-user@lucene.apache.org Received: (qmail 15819 invoked by uid 99); 6 Oct 2008 20:46:13 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 06 Oct 2008 13:46:13 -0700 X-ASF-Spam-Status: No, hits=0.1 required=10.0 tests=DNS_FROM_SECURITYSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [80.190.253.131] (HELO mail.baseserver.net) (80.190.253.131) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 06 Oct 2008 20:45:07 +0000 Received: from [192.168.1.4] (c186002.adsl.hansenet.de [213.39.186.2]) by mail.baseserver.net (Postfix) with ESMTP id 835039B564 for ; Mon, 6 Oct 2008 22:45:39 +0200 (CEST) Subject: Re: How to tokenize/analyze docs for the spellchecker - at indexing and query time From: Martin Grotzke To: solr-user@lucene.apache.org In-Reply-To: References: <1222859479.4197.52.camel@localhost.localdomain.tld> <75c31b2a0810031321xa20d910vf6902e4d39aadfc7@mail.gmail.com> <1223279505.3836.17.camel@localhost.localdomain.tld> Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="=-pvlzeNpkDcMhkNRvr7lF" Date: Mon, 06 Oct 2008 22:45:38 +0200 Message-Id: <1223325938.4802.26.camel@localhost.localdomain.tld> Mime-Version: 1.0 X-Mailer: Evolution 2.22.3.1 (2.22.3.1-1.fc9) X-Virus-Checked: Checked by ClamAV on apache.org --=-pvlzeNpkDcMhkNRvr7lF Content-Type: text/plain Content-Transfer-Encoding: quoted-printable On Mon, 2008-10-06 at 09:00 -0400, Grant Ingersoll wrote:=20 > On Oct 6, 2008, at 3:51 AM, Martin Grotzke wrote: >=20 > > Hi Jason, > > > > what about multi-word searches like "harry potter"? When I do a search > > in our index for "harry poter", I get the suggestion "harry > > spotter" (using spellcheck.collate=3Dtrue and jarowinkler distance). > > Searching for "harry spotter" (we're searching AND, not OR) then gives > > no results. I asume that this is because suggestions are done for =20 > > words > > separately, and this does not require that both/all suggestions are > > contained in the same document. > > >=20 > Yeah, the SpellCheckComponent is not phrase aware. My guess would be =20 > that you would somehow need a QueryConverter (see http://wiki.apache.org/= solr/SpellCheckComponent)=20 > that preserved phrases as a single token. Likewise, you would need =20 > that on your indexing side as well for the spell checker. In short, I =20 > suppose it's possible, but it would be work. You probably could use =20 > the shingle filter (token based n-grams). I also thought about s.th. like this, and also stumbled over the ShingleFilter :) So I would change the "spell" field to use the ShingleFilter? Did I understand the answer to the posting "chaining copyFields" correctly, that I cannot pipe the title through some "shingledTitle" field and copy it afterwards to the "spell" field (while other fields like brand are copied directly to the spell field)? Thanx && cheers, Martin >=20 > Alternatively, by using extendedResults, you can get back the =20 > frequency of each of the words, and then you could decide whether the =20 > collation is going to have any results assuming they are all or'd =20 > together. For phrases and AND queries, I'm not sure. It's doable, =20 > I'm sure, but it would be a lot more involved. >=20 >=20 > > I wonder what's the standard approach for searches with multiple =20 > > words. > > Are these working ok for you? > > > > Cheers, > > Martin > > > > On Fri, 2008-10-03 at 16:21 -0400, Jason Rennie wrote: > >> Hi Martin, > >> > >> I'm a relative newbie to solr, have been playing with the spellcheck > >> component and seem to have it working. I certainly can't explain =20 > >> what all > >> is going on, but with any luck, I can help you get the spellchecker > >> up-and-running. Additional replies in-lined below. > >> > >> On Wed, Oct 1, 2008 at 7:11 AM, Martin Grotzke >>> wrote: > >> > >>> Now I'm thinking about the source-field in the spellchecker =20 > >>> ("spell"): > >>> how should fields be analyzed during indexing, and how should the > >>> queryAnalyzerFieldType be configured. > >> > >> > >> I followed the conventions in the default solrconfig.xml and =20 > >> schema.xml > >> files. So I created a "textSpell" field type (schema.xml): > >> > >> > >> >> positionIncrementGap=3D"100"> > >> > >> > >> > >> > >> > >> > >> > >> and used this for the queryAnalyzerFieldType. I also created a =20 > >> spellField > >> to store the text I want to spell check against and used the same =20 > >> analyzer > >> (figuring that the query and indexed data should be analyzed the =20 > >> same way) > >> (schema.xml): > >> > >> > >> >> stored=3D"true" /> > >> > >> > >> > >>> If I have brands like e.g. "Apple" or "Ed Hardy" I would copy them =20 > >>> (the > >>> field "brand") directly to the "spell" field. The "spell" field is =20 > >>> of > >>> type "string". > >> > >> > >> We're copying description to spellField. I'd recommend using a =20 > >> type like > >> the above textSpell type since "The StringField type is not =20 > >> analyzed, but > >> indexed/stored verbatim" (schema.xml): > >> > >> > >> > >> Other fields like e.g. the product title I would first copy to some > >>> whitespaceTokinized field (field type with =20 > >>> WhitespaceTokenizerFactory) > >>> and afterwards to the "spell" field. The product title might be e.g. > >>> "Canon EOS 450D EF-S 18-55 mm". > >> > >> > >> Hmm... I'm not sure if this would work as I don't think the =20 > >> analyzer is > >> applied until after the copy is made. FWIW, I've had trouble copying > >> multipe fields to spellField (i.e. adding a second copyField w/ > >> dest=3D"spellField"), so we just index the spellchecker on a single =20 > >> field... > >> > >> Shouldn't this be a WhitespaceTokenizerFactory, or is it better to =20 > >> use a > >>> StandardTokenizerFactory here? > >> > >> > >> I think if you use the same analyzer for indexing and queries, the > >> distinction probably isn't tremendously important. When I went =20 > >> searching, > >> it looked like the StandardTokenizer split on non-letters. I'd =20 > >> guess the > >> rationale for using the StandardTokenizer is that it won't recommend > >> non-letter characters. I was seeing some weirdness earlier (no > >> inserts/deletes), but that disappeared now that I'm using the > >> StandardTokenizer. > >> > >> Cheers, > >> > >> Jason > > --=20 > > Martin Grotzke > > http://www.javakaffee.de/blog/ >=20 > -------------------------- > Grant Ingersoll >=20 > Lucene Helpful Hints: > http://wiki.apache.org/lucene-java/BasicsOfPerformance > http://wiki.apache.org/lucene-java/LuceneFAQ >=20 >=20 >=20 >=20 >=20 >=20 >=20 >=20 --=-pvlzeNpkDcMhkNRvr7lF Content-Type: application/pgp-signature; name=signature.asc Content-Description: This is a digitally signed message part -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) iEYEABECAAYFAkjqeO0ACgkQ7FvOl7Te+pYXXQCfcARFKcF0kQsf/Sf5ANKj5WJd luQAn2VD2V7OUlmrnSilAK8/KQ77RkK2 =jf29 -----END PGP SIGNATURE----- --=-pvlzeNpkDcMhkNRvr7lF--