Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: solr-user@lucene.apache.org
Received-SPF: pass (athena.apache.org: local policy)
Subject: Re: How to tokenize/analyze docs for the spellchecker - at
	indexing and query time
From: Martin Grotzke <martin.grotzke@javakaffee.de>
To: solr-user@lucene.apache.org
In-Reply-To: <EA98829C-DEB2-410B-85D7-35DFCE8E8349@apache.org>
References: <1222859479.4197.52.camel@localhost.localdomain.tld>
	 <75c31b2a0810031321xa20d910vf6902e4d39aadfc7@mail.gmail.com>
	 <1223279505.3836.17.camel@localhost.localdomain.tld>
	 <EA98829C-DEB2-410B-85D7-35DFCE8E8349@apache.org>
Content-Type: multipart/signed; micalg=pgp-sha1;
 protocol="application/pgp-signature"; boundary="=-pvlzeNpkDcMhkNRvr7lF"
Date: Mon, 06 Oct 2008 22:45:38 +0200
Message-Id: <1223325938.4802.26.camel@localhost.localdomain.tld>
Mime-Version: 1.0

--=-pvlzeNpkDcMhkNRvr7lF
Content-Type: text/plain
Content-Transfer-Encoding: quoted-printable

On Mon, 2008-10-06 at 09:00 -0400, Grant Ingersoll wrote:=20
> On Oct 6, 2008, at 3:51 AM, Martin Grotzke wrote:
>=20
> > Hi Jason,
> >
> > what about multi-word searches like "harry potter"? When I do a search
> > in our index for "harry poter", I get the suggestion "harry
> > spotter" (using spellcheck.collate=3Dtrue and jarowinkler distance).
> > Searching for "harry spotter" (we're searching AND, not OR) then gives
> > no results. I asume that this is because suggestions are done for =20
> > words
> > separately, and this does not require that both/all suggestions are
> > contained in the same document.
> >
>=20
> Yeah, the SpellCheckComponent is not phrase aware.  My guess would be =20
> that you would somehow need a QueryConverter (see http://wiki.apache.org/=
solr/SpellCheckComponent)=20
>    that preserved phrases as a single token.  Likewise, you would need =20
> that on your indexing side as well for the spell checker.  In short, I =20
> suppose it's possible, but it would be work.  You probably could use =20
> the shingle filter (token based n-grams).
I also thought about s.th. like this, and also stumbled over the
ShingleFilter :)

So I would change the "spell" field to use the ShingleFilter?

Did I understand the answer to the posting "chaining copyFields"
correctly, that I cannot pipe the title through some "shingledTitle"
field and copy it afterwards to the "spell" field (while other fields
like brand are copied directly to the spell field)?

Thanx && cheers,
Martin


>=20
> Alternatively, by using extendedResults, you can get back the =20
> frequency of each of the words, and then you could decide whether the =20
> collation is going to have any results assuming they are all or'd =20
> together.  For phrases and AND queries, I'm not sure.  It's doable, =20
> I'm sure, but it would be a lot more involved.
>=20
>=20
> > I wonder what's the standard approach for searches with multiple =20
> > words.
> > Are these working ok for you?
> >
> > Cheers,
> > Martin
> >
> > On Fri, 2008-10-03 at 16:21 -0400, Jason Rennie wrote:
> >> Hi Martin,
> >>
> >> I'm a relative newbie to solr, have been playing with the spellcheck
> >> component and seem to have it working.  I certainly can't explain =20
> >> what all
> >> is going on, but with any luck, I can help you get the spellchecker
> >> up-and-running.  Additional replies in-lined below.
> >>
> >> On Wed, Oct 1, 2008 at 7:11 AM, Martin Grotzke <martin.grotzke@javakaf=
fee.de
> >>> wrote:
> >>
> >>> Now I'm thinking about the source-field in the spellchecker =20
> >>> ("spell"):
> >>> how should fields be analyzed during indexing, and how should the
> >>> queryAnalyzerFieldType be configured.
> >>
> >>
> >> I followed the conventions in the default solrconfig.xml and =20
> >> schema.xml
> >> files.  So I created a "textSpell" field type (schema.xml):
> >>
> >>    <!-- field type for the spell checker which doesn't stem -->
> >>    <fieldtype name=3D"textSpell" class=3D"solr.TextField"
> >> positionIncrementGap=3D"100">
> >>      <analyzer>
> >>        <tokenizer class=3D"solr.StandardTokenizerFactory"/>
> >>        <filter class=3D"solr.LowerCaseFilterFactory"/>
> >>        <filter class=3D"solr.RemoveDuplicatesTokenFilterFactory"/>
> >>      </analyzer>
> >>    </fieldtype>
> >>
> >> and used this for the queryAnalyzerFieldType.  I also created a =20
> >> spellField
> >> to store the text I want to spell check against and used the same =20
> >> analyzer
> >> (figuring that the query and indexed data should be analyzed the =20
> >> same way)
> >> (schema.xml):
> >>
> >>   <!-- Spell check field -->
> >>   <field name=3D"spellField" type=3D"textSpell" indexed=3D"true" =20
> >> stored=3D"true" />
> >>
> >>
> >>
> >>> If I have brands like e.g. "Apple" or "Ed Hardy" I would copy them =20
> >>> (the
> >>> field "brand") directly to the "spell" field. The "spell" field is =20
> >>> of
> >>> type "string".
> >>
> >>
> >> We're copying description to spellField.  I'd recommend using a =20
> >> type like
> >> the above textSpell type since "The StringField type is not =20
> >> analyzed, but
> >> indexed/stored verbatim" (schema.xml):
> >>
> >>  <copyField source=3D"description" dest=3D"spellField" />
> >>
> >> Other fields like e.g. the product title I would first copy to some
> >>> whitespaceTokinized field (field type with =20
> >>> WhitespaceTokenizerFactory)
> >>> and afterwards to the "spell" field. The product title might be e.g.
> >>> "Canon EOS 450D EF-S 18-55 mm".
> >>
> >>
> >> Hmm... I'm not sure if this would work as I don't think the =20
> >> analyzer is
> >> applied until after the copy is made.  FWIW, I've had trouble copying
> >> multipe fields to spellField (i.e. adding a second copyField w/
> >> dest=3D"spellField"), so we just index the spellchecker on a single =20
> >> field...
> >>
> >> Shouldn't this be a WhitespaceTokenizerFactory, or is it better to =20
> >> use a
> >>> StandardTokenizerFactory here?
> >>
> >>
> >> I think if you use the same analyzer for indexing and queries, the
> >> distinction probably isn't tremendously important.  When I went =20
> >> searching,
> >> it looked like the StandardTokenizer split on non-letters.  I'd =20
> >> guess the
> >> rationale for using the StandardTokenizer is that it won't recommend
> >> non-letter characters.  I was seeing some weirdness earlier (no
> >> inserts/deletes), but that disappeared now that I'm using the
> >> StandardTokenizer.
> >>
> >> Cheers,
> >>
> >> Jason
> > --=20
> > Martin Grotzke
> > http://www.javakaffee.de/blog/
>=20
> --------------------------
> Grant Ingersoll
>=20
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ
>=20
>=20
>=20
>=20
>=20
>=20
>=20
>=20

--=-pvlzeNpkDcMhkNRvr7lF
Content-Type: application/pgp-signature; name=signature.asc
Content-Description: This is a digitally signed message part

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)

iEYEABECAAYFAkjqeO0ACgkQ7FvOl7Te+pYXXQCfcARFKcF0kQsf/Sf5ANKj5WJd
luQAn2VD2V7OUlmrnSilAK8/KQ77RkK2
=jf29
-----END PGP SIGNATURE-----

--=-pvlzeNpkDcMhkNRvr7lF--