lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Allison, Timothy B." <talli...@mitre.org>
Subject RE: QueryParser
Date Mon, 24 Mar 2014 16:24:45 GMT
To expand on Herb's comment, in Lucene, the StandardAnalyzer will break CJK into characters:


1 : 轻
2 : 歌
3 : 曼
4 : 舞
5 : 庆
6 : 元
7 : 旦

If you initialize the classic QueryParser with StandardAnalyzer, the parser will use that
Analyzer to break this string into individual characters as above.  From a linguistic standpoint,
this is unnerving, but from a retrieval perspective, this should work fairly well as long
as you are also doing some kind of normalization (ICU or CJKWidthFilter).  As Herb mentioned,
you might consider experimenting with smartcn to try to tokenize on actual words; as an example,
the SmartChineseAnalyzer breaks the string into:

1 : 轻歌曼舞
2 : 庆
3 : 元旦

In Solr, if you use the default "text_cjk", you'll get this bigram behavior because of CJKBigramFilterFactory.
 If you don't want bigram behavior, consider removing that filter; or if you want both bigrams
and unigrams, consider adding: outputUnigrams="true" as in:

<filter class="solr.CJKBigramFilterFactory" outputUnigrams="true"/>


-----Original Message-----
From: Herb Roitblat [mailto:herb.roitblat@orcatec.com] 
Sent: Monday, March 24, 2014 9:01 AM
To: java-user@lucene.apache.org; kalaiselvan.k@zohocorp.com
Subject: Re: QueryParser

The default query parser for CJK languages breaks text into bigrams.  A 
word consisting of characters ABCDE is broken into tokens  AB, BC, CD, 
DE, or

"轻歌曼舞庆元旦"

into
data:轻歌 data:歌曼 data:曼舞 data:舞庆 data:庆元 data:元旦

Each pair may or may not be a word, but if you use the same parser (i.e. 
analyzer) for indexing and for searching, you should get reasonable 
results.  A more powerful parser, typically one that includes a 
dictionary, is available, and may give more expected analyses at the 
cost of being slower.

Look here, for example: 
http://lucene.apache.org/core/4_0_0/analyzers-common/index.html
and here: http://lucene.apache.org/core/4_0_0/analyzers-smartcn/index.html



On 3/23/2014 11:21 PM, kalaik wrote:
> Dear Team,
>
>                  Any Update ?
>
>
>
>
>
>
>
>
> ---- On Fri, 21 Mar 2014 14:40:51 +0530 kalaik &lt;kalaiselvan.k@zohocorp.com&gt;
wrote ----
>
>
>
>
> Dear Team,
>
>                  we are using lucene in our product , it well searching for high speed
and performance but
>
>
>                  Japaneese, chinese and korean language not searching properly we had
use QueryParser
>
>
>                  QueryParser is splitted into word like "轻歌曼舞庆元旦"
>
>
>                   Example
>                          
>                              This word "轻歌曼舞庆元旦"
>   
>                             splited word :  data:轻歌 data:歌曼 data:曼舞 data:舞庆
data:庆元 data:元旦
>
> here is my code
>
>                              Query query =  parser.parse(searchData);
>           
>                               logger.log(Level.INFO,"Search Query is calling {0}",query);
>                                  
>                               TopDocs docs = is.search(query, resultRowSize);
>
>
> In case of any clarification please get back to me. please help as soon as possible
>
>
> Regards,
> kalai..
>
>
>
>
>
>
>
>
>
>
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Mime
View raw message