lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From karl wettin <karl.wet...@gmail.com>
Subject Re: Lucene for chinese search
Date Mon, 18 Jun 2007 20:42:06 GMT
Don't they differ in tokenization? One of them uses grams, the other  
does not. Or? That would be another thing that might mess it up. But  
then I never looked at the highlighter, so I can only guess.

--
karl

18 jun 2007 kl. 22.37 skrev Chris Lu:

> Hi, Karl,
>
> Thanks for sharing this experience.
>
> I did find CJKAnalyzer somehow behaves differently than
> ChineseAnalyzer. When trying to highlight the matched term,
> ChineseAnalyzer didn't work somehow. But I didn't investigate into it.
>
> This is a useful clue for it.
>
> -- 
> Chris Lu
> -------------------------
> Instant Scalable Full-Text Search On Any Database/Application
> site: http://www.dbsight.net
> demo: http://search.dbsight.com
> Lucene Database Search in 3 minutes:
> http://wiki.dbsight.com/index.php? 
> title=Create_Lucene_Database_Search_in_3_minutes
>
>
> On 6/18/07, karl wettin <karl.wettin@gmail.com> wrote:
>> A year or two ago I hacked Lucene to use UTF16 instead of UTF8 as CJK
>> characters are represented by 3 bytes with UTF8, and 2 bytes as
>> UTF16. It is a simple hack.
>>
>> It did however not save me that much as I had a mixed latin and CJK
>> corpus, and I reverted. Still think it is something worth
>> considering. Perhaps it might be worth implementing per index, per
>> document or per field string encoding strategy.
>>
>>
>>
>>
>> 18 jun 2007 kl. 20.01 skrev Chris Lu:
>>
>> > Basically where ever you see, the encoding should be utf8.
>> >
>> > The servlet also has an encoding setting. For your case, change the
>> > tomcat setting.
>> > When rendering jsp page, the encoding also matters.
>> >
>> > --
>> > Chris Lu
>> > -------------------------
>> > Instant Scalable Full-Text Search On Any Database/Application
>> > site: http://www.dbsight.net
>> > demo: http://search.dbsight.com
>> > Lucene Database Search in 3 minutes:
>> > http://wiki.dbsight.com/index.php?
>> > title=Create_Lucene_Database_Search_in_3_minutes
>> >
>> > On 6/18/07, Lee Li Bin <leelb@xedge.com.sg> wrote:
>> >>
>> >> Hi,
>> >>
>> >> For indexing, there is no problem, there is Chinese text similar
>> >> to my
>> >> datasource (XML) in the index file when opening on a note pad.
>> >>
>> >> When I try to use the utf8 in jsp and, getbytes array of  
>> 'utf-8' or
>> >> ISO88599_1 or Cp1252 in Java servlet, but we getting search
>> >> problem, the
>> >> search result does not display for Chinese term.
>> >>
>> >> I mixed English and Chinese text in my datasource, the search is
>> >> working for
>> >> English term, and Chinese char display as '???' in the result  
>> output.
>> >>
>> >> Please advice or send some sample / solutions
>> >>
>> >> Thanks.
>> >>
>> >> -----Original Message-----
>> >> From: Mathieu Lecarme [mailto:mathieu@garambrogne.net]
>> >> Sent: Monday, June 18, 2007 8:58 PM
>> >> To: java-user@lucene.apache.org
>> >> Subject: Re: Lucene for chinese search
>> >>
>> >> Lee Li Bin a écrit :
>> >> > Hi,
>> >> >
>> >> > I still met problem for searching of Chinese words.
>> >> > XMl file which is the datasource and analyzer has already been
>> >> encoded.
>> >> > Have testing on StandardAnalyzer, CJKAnalyzer, and
>> >> ChineseAnalyzer, but it
>> >> > still can't get any results.
>> >> >
>> >> > 1.    do we need any encoding configuration in apache tomcat for
>> >> Chinese
>> >> > search using Lucence
>> >> >
>> >> > 2.    do we need to use JSP meta / page encoding ? what is the
>> >> encoding
>> >> > for   jsp?
>> >> >
>> >> try first with simple junit test, after you can fight with UTF8
>> >> parameters.
>> >>
>> >> M.
>> >>
>> >>  
>> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: java-user-help@lucene.apache.org
>> >>
>> >>
>> >>
>> >>  
>> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: java-user-help@lucene.apache.org
>> >>
>> >>
>> >
>> >  
>> ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> > For additional commands, e-mail: java-user-help@lucene.apache.org
>> >
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message