lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris D <bro...@gmail.com>
Subject Re: Indexing puncutation
Date Tue, 28 Jun 2005 19:37:20 GMT
On 6/28/05, Aigner, Thomas <TAigner@wescodist.com> wrote:
> Thanks for the info Chris.
> 
> 
> 
> I'd thought I'd provide some more infomation.  One problem is the
> descriptions are not easily formatted. In other words, the description
> doesn't follow a certain set of rules (num num - alpha alpha etc).  They
> are literally anything a supplier has put in for them.
> 
> 
> 
> The example below (21-MA-GAB) is stored differently by these analyzers:
> 
> WhitespaceAnalyzer:     [21-MA-GAB]
> 
> SimpleAnalyzer:         [ma] [gab]
> 
> StopAnalyzer:           [ma] [gab]
> 
> StandardAnalyzer:       [21-ma] [gab]
> 
> SynonymAnalyzer:        [21-ma] [gab]
> 
>       (One I created for synonyms.. much like the standard one)
> 
> SnowballAnalyzer:       [21-ma] [gab]
> 
> 
> 
> My problem is searching for 21magab returns nothing as well as 21ma*
> etc..
> 
> 
> 
> This is just one of my punctuation problems.. there can be "" for inches
> and 1/2 items etc..
> 
> 
> 
> I am currently using my SynonymnAnalyzer for some aliases to build the
> index and the SnowballAnalyzer to query the index (nice stemming in it)
> 
> 
> 
> Tom

You can write an analyzer to do your tokenizing so that you end up with 
21-MA-GAB being stored as [21magab] in the index. Assuming the codes
are formatted mostly the same. That's what I was suggesting, not use a
different analyzer. (If they're not the same then it becomes more
difficult)

The other problems you're describing with the descriptions could also
be solved with a proper analyzer. Add a "FRACTION" type to the lexical
grammar, and don't strip punctuation like the quote. Or synonym "1/2"
to "half" I guess (I haven't done much work with synonyms).

Lastly, and someone should correct me if I'm wrong, but you should
always use the same analyzer to create and to query the index.
Otherwise queries that should return hits wont. For instance the
following.

   The canoist paddles

Could be indexed as [boater] [strokes]... And the query

   contents:paddles

would be parsed to [paddle] and likely would not get the hit you expect.

Cheers,
Chris

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message