lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kolhoff, Jacqueline - ENCOWAY" <Kolh...@encoway.de>
Subject AW: Lucene and Chinese language
Date Thu, 01 Jul 2010 12:27:16 GMT
Hm, that doesn't work for me. 

Maybe we misunderstood:

in my query there aren't any double quotes (just add them to show that this is my query).

I search in a field for a special value (substring) like 在电力虎

My query looks like this:

+anotherfieldname:description +myfieldname:*在电力虎*

We always add the multiple wildcard character (*).
Query String is 在电力虎

With the first way

QueryParser queryParser = new QueryParser(Version.LUCENE_30, "myfieldname"
 , new StandardAnalyzer(Version.LUCENE_30));

we got no results.

If we do not add the * it works with StandardAnalyzer for Chinese, query is:
+anotherfieldname:description +myfieldname:"在 电 力 虎"

As you can see, the query parser automatically added double quotes and blanks. But this does
not work for our English or German queries.

If I use the PositionHackAnalyzerWrapper and the case with * I got no results, query is:
+anotherfieldname:description +myfieldname:*在电力虎*

If I remove the * the query is:
+ anotherfieldname: description +(myfieldname:在myfieldname:电myfieldname:力myfieldname:虎)

and I got results but not for German or English queries.

Weird?


-----Ursprüngliche Nachricht-----
Von: Robert Muir [mailto:rcmuir@gmail.com] 
Gesendet: Donnerstag, 1. Juli 2010 13:51
An: java-user@lucene.apache.org
Betreff: Re: Lucene and Chinese language

you can make your own analyzer, or do something like the below at
query-time.

QueryParser queryParser = new QueryParser(Version.LUCENE_30, "myfieldname" ,
new PositionHackAnalyzerWrapper(new StandardAnalyzer(Version.LUCENE_30)));

public class PositionHackAnalyzerWrapper extends Analyzer {
  Analyzer wrapped;

  public PositionHackAnalyzerWrapper(Analyzer wrapped) {
    this.wrapped = wrapped;
  }

  @Override
  public TokenStream tokenStream(String fieldName, Reader reader) {
    TokenStream ts = wrapped.tokenStream(fieldName, reader);
    return new PositionFilter(ts);
  }
}

2010/7/1 Kolhoff, Jacqueline - ENCOWAY <Kolhoff@encoway.de>

> How can I add this PositionFilter? I can't see anything in the API. I use
> lucene version 3.0.1, this is my query parser:
>
> QueryParser queryParser = new QueryParser(Version.LUCENE_30, "myfieldname"
> , new StandardAnalyzer(Version.LUCENE_30));
>
> -----Ursprüngliche Nachricht-----
> Von: Robert Muir [mailto:rcmuir@gmail.com]
> Gesendet: Donnerstag, 1. Juli 2010 12:34
> An: java-user@lucene.apache.org
> Betreff: Re: Lucene and Chinese language
>
> This is a bug in the queryparser. (
> https://issues.apache.org/jira/browse/LUCENE-2458)
>
> the problem has nothing to do with your choice of analyzer, it has to do
> with how the query is formed.
>
> Currently the queryparser uses a convoluted algorithm involving whitespace
> (and not just the double quote operator as you would expect) to form phrase
> queries. So, queries like this with no whitespace form phrase queries
> always.
>
> The only workaround for reasonably good results consists of two steps:
> 1. at query time (only!) add a
> org.apache.lucene.analysis.position.PositionFilter (from contrib/analyzers)
> to your analyzer. don't do this at index-time, just query-time!
> 2. this will make all terms in the query "synonyms" of each other to bypass
> this problem, but will screw up scoring, so you might want to also extend
> QueryParser in a custom way:
>
> @Override
>  protected BooleanQuery newBooleanQuery(boolean disableCoord) {
>   // intentionally ignore disabled
>   // coord() factor from the PositionFilter hack.
>   return new BooleanQuery(false);
>  }
>
> 2010/7/1 Kolhoff, Jacqueline - ENCOWAY <Kolhoff@encoway.de>
>
> >
> > Hi!
> >
> > We are using lucene in our project to search through information objects
> > which works fine. For indexing we use the StandardAnalyzer.
> > Now, we have to support the Chinese language. I found out that the
> Chinese
> > words and letters are correctly saved in the index but the query to
> search
> > for them does not work. Example: in English language the query is “text”
> > which we parse to “*text*”. If we search for Chinese words / phrases like
> > “佛山东方书城”the query is “*佛山东方书城*“ but there are no
search results. If the
> > query places blanks between the single letters / symbols like this “*佛 山
> 东 方
> > 书 城*“ we are getting results. Does the StandardAnalyzer interpret each
> > Chinese letter as one word? What are best practices for this case? Shall
> we
> > use another analyzer (Chinese analyzer)? Or is it better to replace the
> > query parser in this case?
> >
> > Regards,
> > Jacqueline.
> >
>
>
>
> --
> Robert Muir
> rcmuir@gmail.com
>



-- 
Robert Muir
rcmuir@gmail.com
Mime
View raw message