lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris Lu" <chris...@gmail.com>
Subject Re: Lucene for chinese search
Date Mon, 18 Jun 2007 20:37:04 GMT
Hi, Karl,

Thanks for sharing this experience.

I did find CJKAnalyzer somehow behaves differently than
ChineseAnalyzer. When trying to highlight the matched term,
ChineseAnalyzer didn't work somehow. But I didn't investigate into it.

This is a useful clue for it.

-- 
Chris Lu
-------------------------
Instant Scalable Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes:
http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes


On 6/18/07, karl wettin <karl.wettin@gmail.com> wrote:
> A year or two ago I hacked Lucene to use UTF16 instead of UTF8 as CJK
> characters are represented by 3 bytes with UTF8, and 2 bytes as
> UTF16. It is a simple hack.
>
> It did however not save me that much as I had a mixed latin and CJK
> corpus, and I reverted. Still think it is something worth
> considering. Perhaps it might be worth implementing per index, per
> document or per field string encoding strategy.
>
>
>
>
> 18 jun 2007 kl. 20.01 skrev Chris Lu:
>
> > Basically where ever you see, the encoding should be utf8.
> >
> > The servlet also has an encoding setting. For your case, change the
> > tomcat setting.
> > When rendering jsp page, the encoding also matters.
> >
> > --
> > Chris Lu
> > -------------------------
> > Instant Scalable Full-Text Search On Any Database/Application
> > site: http://www.dbsight.net
> > demo: http://search.dbsight.com
> > Lucene Database Search in 3 minutes:
> > http://wiki.dbsight.com/index.php?
> > title=Create_Lucene_Database_Search_in_3_minutes
> >
> > On 6/18/07, Lee Li Bin <leelb@xedge.com.sg> wrote:
> >>
> >> Hi,
> >>
> >> For indexing, there is no problem, there is Chinese text similar
> >> to my
> >> datasource (XML) in the index file when opening on a note pad.
> >>
> >> When I try to use the utf8 in jsp and, getbytes array of 'utf-8' or
> >> ISO88599_1 or Cp1252 in Java servlet, but we getting search
> >> problem, the
> >> search result does not display for Chinese term.
> >>
> >> I mixed English and Chinese text in my datasource, the search is
> >> working for
> >> English term, and Chinese char display as '???' in the result output.
> >>
> >> Please advice or send some sample / solutions
> >>
> >> Thanks.
> >>
> >> -----Original Message-----
> >> From: Mathieu Lecarme [mailto:mathieu@garambrogne.net]
> >> Sent: Monday, June 18, 2007 8:58 PM
> >> To: java-user@lucene.apache.org
> >> Subject: Re: Lucene for chinese search
> >>
> >> Lee Li Bin a écrit :
> >> > Hi,
> >> >
> >> > I still met problem for searching of Chinese words.
> >> > XMl file which is the datasource and analyzer has already been
> >> encoded.
> >> > Have testing on StandardAnalyzer, CJKAnalyzer, and
> >> ChineseAnalyzer, but it
> >> > still can't get any results.
> >> >
> >> > 1.    do we need any encoding configuration in apache tomcat for
> >> Chinese
> >> > search using Lucence
> >> >
> >> > 2.    do we need to use JSP meta / page encoding ? what is the
> >> encoding
> >> > for   jsp?
> >> >
> >> try first with simple junit test, after you can fight with UTF8
> >> parameters.
> >>
> >> M.
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message