lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <rcm...@gmail.com>
Subject Re: Lucene and Chinese language
Date Thu, 01 Jul 2010 10:33:33 GMT
This is a bug in the queryparser. (
https://issues.apache.org/jira/browse/LUCENE-2458)

the problem has nothing to do with your choice of analyzer, it has to do
with how the query is formed.

Currently the queryparser uses a convoluted algorithm involving whitespace
(and not just the double quote operator as you would expect) to form phrase
queries. So, queries like this with no whitespace form phrase queries
always.

The only workaround for reasonably good results consists of two steps:
1. at query time (only!) add a
org.apache.lucene.analysis.position.PositionFilter (from contrib/analyzers)
to your analyzer. don't do this at index-time, just query-time!
2. this will make all terms in the query "synonyms" of each other to bypass
this problem, but will screw up scoring, so you might want to also extend
QueryParser in a custom way:

@Override
 protected BooleanQuery newBooleanQuery(boolean disableCoord) {
   // intentionally ignore disabled
   // coord() factor from the PositionFilter hack.
   return new BooleanQuery(false);
 }

2010/7/1 Kolhoff, Jacqueline - ENCOWAY <Kolhoff@encoway.de>

>
> Hi!
>
> We are using lucene in our project to search through information objects
> which works fine. For indexing we use the StandardAnalyzer.
> Now, we have to support the Chinese language. I found out that the Chinese
> words and letters are correctly saved in the index but the query to search
> for them does not work. Example: in English language the query is “text”
> which we parse to “*text*”. If we search for Chinese words / phrases like
> “佛山东方书城”the query is “*佛山东方书城*“ but there are no search
results. If the
> query places blanks between the single letters / symbols like this “*佛 山 东 方
> 书 城*“ we are getting results. Does the StandardAnalyzer interpret each
> Chinese letter as one word? What are best practices for this case? Shall we
> use another analyzer (Chinese analyzer)? Or is it better to replace the
> query parser in this case?
>
> Regards,
> Jacqueline.
>



-- 
Robert Muir
rcmuir@gmail.com

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message