lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From <Bruce.Nawro...@gxs.com>
Subject Unicode Tokenizer problem with Registered Trademark Search
Date Wed, 02 Apr 2008 20:58:38 GMT

I am having a problem when searching for certain Unicode characters, such as the Registered
Trademark. That's the Unicode character 00AE. It's also a problem searching for a Japanese
Yen symbol (Unicode character 00A5).

I'm using the Lucene 2.0.0 jar file, and we used to use Lucene 1.4.2 jar file, where this
used to work OK. But Lucene 2.0.0 doesn't work the same way.

I see that the registered trademark is in the Lucene index file, so that's good. The problem
comes when I try to search for these characters.

I see that my query starts off OK, as this:

( (Locale:en) AND ( productName:(DigitalĀ„^95) ) )    (if you cannot see the Japanese Yen
symbol, it comes directly after "Digital")

Note: the "^95" is just a boost factor, and is OK.

I'm using StandardAnalyzer and StandardTokenizer to create a new QueryParser , and after I
call the "parse" method of the QueryParser, my query becomes this:

 +Locale:en +productName:digital^95.0

Notice that the Japanese Yen symbol is gone! I think it's because the StandardTokenizer.jj
file doesn't handle this character, and so it throws it away.

Is there any way to use a different Analyzer and/or Tokenizer, rather than building my own?

And if I had created my Lucene indexes with the StandardAnalyzer, must I use the StandardAnalyzer
and StandardTokenizer to search the index?

Thanks.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message