Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (asf.osuosl.org: domain of brogar@gmail.com designates
 64.233.184.199 as permitted sender)
DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws;
        s=beta; d=gmail.com;
        h=received:message-id:date:from:reply-to:to:subject:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references;
        b=my4Abi/SQY3oo5gi0MdUraOcC+56q7RbdKwktaFAm0scfFMuMFEkl7coSjF5zSfB/jOhF+SGXgVvE/nwldz5n8nUQZYsLwoO20v/EMlXgoZjtQ3BnhZ4O7dK0NJ0v7JN+6Czv9c0t1vRBh3k9V3r1AKEsy2jPvFdjAmfMl45k9c=
Message-ID: <34cc3b0a05062812372d08a050@mail.gmail.com>
Date: Tue, 28 Jun 2005 15:37:20 -0400
From: Chris D <brogar@gmail.com>
Reply-To: Chris D <brogar@gmail.com>
To: java-user@lucene.apache.org
Subject: Re: Indexing puncutation
In-Reply-To: <14FBF41EF1411B45B2EC4ADEAC53D1310342B6FE@MAIL01.wescodist.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline
References: <14FBF41EF1411B45B2EC4ADEAC53D1310342B6FE@MAIL01.wescodist.com>

On 6/28/05, Aigner, Thomas <TAigner@wescodist.com> wrote:
> Thanks for the info Chris.
>=20
>=20
>=20
> I'd thought I'd provide some more infomation.  One problem is the
> descriptions are not easily formatted. In other words, the description
> doesn't follow a certain set of rules (num num - alpha alpha etc).  They
> are literally anything a supplier has put in for them.
>=20
>=20
>=20
> The example below (21-MA-GAB) is stored differently by these analyzers:
>=20
> WhitespaceAnalyzer:     [21-MA-GAB]
>=20
> SimpleAnalyzer:         [ma] [gab]
>=20
> StopAnalyzer:           [ma] [gab]
>=20
> StandardAnalyzer:       [21-ma] [gab]
>=20
> SynonymAnalyzer:        [21-ma] [gab]
>=20
>       (One I created for synonyms.. much like the standard one)
>=20
> SnowballAnalyzer:       [21-ma] [gab]
>=20
>=20
>=20
> My problem is searching for 21magab returns nothing as well as 21ma*
> etc..
>=20
>=20
>=20
> This is just one of my punctuation problems.. there can be "" for inches
> and 1/2 items etc..
>=20
>=20
>=20
> I am currently using my SynonymnAnalyzer for some aliases to build the
> index and the SnowballAnalyzer to query the index (nice stemming in it)
>=20
>=20
>=20
> Tom

You can write an analyzer to do your tokenizing so that you end up with=20
21-MA-GAB being stored as [21magab] in the index. Assuming the codes
are formatted mostly the same. That's what I was suggesting, not use a
different analyzer. (If they're not the same then it becomes more
difficult)

The other problems you're describing with the descriptions could also
be solved with a proper analyzer. Add a "FRACTION" type to the lexical
grammar, and don't strip punctuation like the quote. Or synonym "1/2"
to "half" I guess (I haven't done much work with synonyms).

Lastly, and someone should correct me if I'm wrong, but you should
always use the same analyzer to create and to query the index.
Otherwise queries that should return hits wont. For instance the
following.

   The canoist paddles

Could be indexed as [boater] [strokes]... And the query

   contents:paddles

would be parsed to [paddle] and likely would not get the hit you expect.

Cheers,
Chris

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org