lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From karl wettin <karl.wet...@gmail.com>
Subject Re: Lucene for chinese search
Date Mon, 18 Jun 2007 20:00:30 GMT
A year or two ago I hacked Lucene to use UTF16 instead of UTF8 as CJK  
characters are represented by 3 bytes with UTF8, and 2 bytes as  
UTF16. It is a simple hack.

It did however not save me that much as I had a mixed latin and CJK  
corpus, and I reverted. Still think it is something worth  
considering. Perhaps it might be worth implementing per index, per  
document or per field string encoding strategy.




18 jun 2007 kl. 20.01 skrev Chris Lu:

> Basically where ever you see, the encoding should be utf8.
>
> The servlet also has an encoding setting. For your case, change the
> tomcat setting.
> When rendering jsp page, the encoding also matters.
>
> -- 
> Chris Lu
> -------------------------
> Instant Scalable Full-Text Search On Any Database/Application
> site: http://www.dbsight.net
> demo: http://search.dbsight.com
> Lucene Database Search in 3 minutes:
> http://wiki.dbsight.com/index.php? 
> title=Create_Lucene_Database_Search_in_3_minutes
>
> On 6/18/07, Lee Li Bin <leelb@xedge.com.sg> wrote:
>>
>> Hi,
>>
>> For indexing, there is no problem, there is Chinese text similar  
>> to my
>> datasource (XML) in the index file when opening on a note pad.
>>
>> When I try to use the utf8 in jsp and, getbytes array of 'utf-8' or
>> ISO88599_1 or Cp1252 in Java servlet, but we getting search  
>> problem, the
>> search result does not display for Chinese term.
>>
>> I mixed English and Chinese text in my datasource, the search is  
>> working for
>> English term, and Chinese char display as '???' in the result output.
>>
>> Please advice or send some sample / solutions
>>
>> Thanks.
>>
>> -----Original Message-----
>> From: Mathieu Lecarme [mailto:mathieu@garambrogne.net]
>> Sent: Monday, June 18, 2007 8:58 PM
>> To: java-user@lucene.apache.org
>> Subject: Re: Lucene for chinese search
>>
>> Lee Li Bin a écrit :
>> > Hi,
>> >
>> > I still met problem for searching of Chinese words.
>> > XMl file which is the datasource and analyzer has already been  
>> encoded.
>> > Have testing on StandardAnalyzer, CJKAnalyzer, and  
>> ChineseAnalyzer, but it
>> > still can't get any results.
>> >
>> > 1.    do we need any encoding configuration in apache tomcat for  
>> Chinese
>> > search using Lucence
>> >
>> > 2.    do we need to use JSP meta / page encoding ? what is the  
>> encoding
>> > for   jsp?
>> >
>> try first with simple junit test, after you can fight with UTF8  
>> parameters.
>>
>> M.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message