lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steven A Rowe" <sar...@syr.edu>
Subject RE: Unicode Tokenizer problem with Registered Trademark Search
Date Wed, 02 Apr 2008 21:49:19 GMT
Hi Bruce,

On 04/02/2008 at 4:58 PM, Bruce.Nawrocki@gxs.com wrote:
> I am having a problem when searching for certain Unicode
> characters, such as the Registered Trademark. That's the
> Unicode character 00AE. It's also a problem searching for a
> Japanese Yen symbol (Unicode character 00A5).
> 
> I'm using the Lucene 2.0.0 jar file, and we used to use
> Lucene 1.4.2 jar file, where this used to work OK. But Lucene
> 2.0.0 doesn't work the same way.

I don't see anything that would have caused such a change - below is a colored side-by-side
diff of StandardTokenizer.jj at revisions 150560 and 409716, corresponding to the lucene_1_4_2
and lucene_2_0_0 tags, respectively:

<http://svn.apache.org/viewvc/lucene/java/tags/lucene_2_0_0/src/java/org/apache/lucene/analysis/standard/StandardTokenizer.jj?r1=150560&r2=409716&diff_format=h>

(Note that the JavaCC-targetted StandardAnalyzer.jj was replaced at release 2.3.0 by JFlex-targetted
StandardTokenizerImpl.jflex for performance reasons - see <http://issues.apache.org/jira/browse/LUCENE-966>.)

> I see that the registered trademark is in the Lucene index
> file, so that's good. The problem comes when I try to search
> for these characters.
>
> I see that my query starts off OK, as this:
> 
> ( (Locale:en) AND ( productName:(DigitalĀ„^95) ) )    (if you
> cannot see the Japanese Yen symbol, it comes directly after "Digital")
> 
> Note: the "^95" is just a boost factor, and is OK.
> 
> I'm using StandardAnalyzer and StandardTokenizer to create a
> new QueryParser , and after I call the "parse" method of the
> QueryParser, my query becomes this:
> 
>  +Locale:en +productName:digital^95.0
> 
> Notice that the Japanese Yen symbol is gone! I think it's
> because the StandardTokenizer.jj file doesn't handle this
> character, and so it throws it away.
> 
> Is there any way to use a different Analyzer and/or
> Tokenizer, rather than building my own?
> 
> And if I had created my Lucene indexes with the
> StandardAnalyzer, must I use the StandardAnalyzer and
> StandardTokenizer to search the index?

In order for the Yen and Registered Trademark symbols to appear in the index, you must have
used a different analyzer for indexing than the one you're using for querying.  This can lead
to problems, as you have discovered.

The short answer is: you should use the same analyzer.

The longer answer is that you should use "compatible" analyzers.  "Compatibility" means that
the terms produced by the query-time analyzer have corresponding index terms.  Of course,
this condition is satisfied by using the same analyzer at both index- and query-time.  An
example of compatibile, but different, analyzers is index- or query-time synonym injection.

I don't know why you weren't seeing this problem with Lucene 1.4.2, but is it possible that
the 1.4.2-created index did *not* have these two symbols?  If that were true, then you would
get the hits you're looking for, though you might get some others that you don't want.

Steve

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message