Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 82625 invoked from network); 19 Jun 2007 08:19:09 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 19 Jun 2007 08:19:09 -0000 Received: (qmail 1977 invoked by uid 500); 19 Jun 2007 08:19:05 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 1943 invoked by uid 500); 19 Jun 2007 08:19:05 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 1932 invoked by uid 99); 19 Jun 2007 08:19:05 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 19 Jun 2007 01:19:05 -0700 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received-SPF: pass (herse.apache.org: local policy) Received: from [203.117.178.41] (HELO ns21.webhostsg.com) (203.117.178.41) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 19 Jun 2007 01:19:00 -0700 X-ClientAddr: 202.63.154.53 Received: from xenb04 (53.202-63-154.unknown.qala.com.sg [202.63.154.53]) by ns21.webhostsg.com (8.12.11/8.12.11) with ESMTP id l5J8OrdJ012526 for ; Tue, 19 Jun 2007 16:25:12 +0800 Message-Id: <200706190825.l5J8OrdJ012526@ns21.webhostsg.com> From: "Lee Li Bin" To: Subject: RE: Lucene for chinese search Date: Tue, 19 Jun 2007 16:16:10 +0800 MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable X-Mailer: Microsoft Office Outlook, Build 11.0.6353 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.3138 Thread-Index: Acex6YFpGNPSyONCRw68FNE5vs1xOQAYGPQg In-Reply-To: <6e3ae6310706181337j4eef156asbb2f3a39bcbc0a74@mail.gmail.com> X-ns21-MailScanner-Information: Please contact the ISP for more information X-ns21-MailScanner: Found to be clean X-ns21-MailScanner-SpamCheck: X-MailScanner-From: leelb@xedge.com.sg X-Virus-Checked: Checked by ClamAV on apache.org Hi, thanks guys for helping me. I forgot to use back the same analyzer for searching, that's why I can't search for Chinese words.. :) =20 -----Original Message----- From: Chris Lu [mailto:chris.lu@gmail.com]=20 Sent: Tuesday, June 19, 2007 4:37 AM To: java-user@lucene.apache.org Subject: Re: Lucene for chinese search Hi, Karl, Thanks for sharing this experience. I did find CJKAnalyzer somehow behaves differently than ChineseAnalyzer. When trying to highlight the matched term, ChineseAnalyzer didn't work somehow. But I didn't investigate into it. This is a useful clue for it. --=20 Chris Lu ------------------------- Instant Scalable Full-Text Search On Any Database/Application site: http://www.dbsight.net demo: http://search.dbsight.com Lucene Database Search in 3 minutes: http://wiki.dbsight.com/index.php?title=3DCreate_Lucene_Database_Search_i= n_3_m inutes On 6/18/07, karl wettin wrote: > A year or two ago I hacked Lucene to use UTF16 instead of UTF8 as CJK > characters are represented by 3 bytes with UTF8, and 2 bytes as > UTF16. It is a simple hack. > > It did however not save me that much as I had a mixed latin and CJK > corpus, and I reverted. Still think it is something worth > considering. Perhaps it might be worth implementing per index, per > document or per field string encoding strategy. > > > > > 18 jun 2007 kl. 20.01 skrev Chris Lu: > > > Basically where ever you see, the encoding should be utf8. > > > > The servlet also has an encoding setting. For your case, change the > > tomcat setting. > > When rendering jsp page, the encoding also matters. > > > > -- > > Chris Lu > > ------------------------- > > Instant Scalable Full-Text Search On Any Database/Application > > site: http://www.dbsight.net > > demo: http://search.dbsight.com > > Lucene Database Search in 3 minutes: > > http://wiki.dbsight.com/index.php? > > title=3DCreate_Lucene_Database_Search_in_3_minutes > > > > On 6/18/07, Lee Li Bin wrote: > >> > >> Hi, > >> > >> For indexing, there is no problem, there is Chinese text similar > >> to my > >> datasource (XML) in the index file when opening on a note pad. > >> > >> When I try to use the utf8 in jsp and, getbytes array of 'utf-8' or > >> ISO88599_1 or Cp1252 in Java servlet, but we getting search > >> problem, the > >> search result does not display for Chinese term. > >> > >> I mixed English and Chinese text in my datasource, the search is > >> working for > >> English term, and Chinese char display as '???' in the result = output. > >> > >> Please advice or send some sample / solutions > >> > >> Thanks. > >> > >> -----Original Message----- > >> From: Mathieu Lecarme [mailto:mathieu@garambrogne.net] > >> Sent: Monday, June 18, 2007 8:58 PM > >> To: java-user@lucene.apache.org > >> Subject: Re: Lucene for chinese search > >> > >> Lee Li Bin a =E9crit : > >> > Hi, > >> > > >> > I still met problem for searching of Chinese words. > >> > XMl file which is the datasource and analyzer has already been > >> encoded. > >> > Have testing on StandardAnalyzer, CJKAnalyzer, and > >> ChineseAnalyzer, but it > >> > still can't get any results. > >> > > >> > 1. do we need any encoding configuration in apache tomcat for > >> Chinese > >> > search using Lucence > >> > > >> > 2. do we need to use JSP meta / page encoding ? what is the > >> encoding > >> > for jsp? > >> > > >> try first with simple junit test, after you can fight with UTF8 > >> parameters. > >> > >> M. > >> > >> = --------------------------------------------------------------------- > >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > >> For additional commands, e-mail: java-user-help@lucene.apache.org > >> > >> > >> > >> = --------------------------------------------------------------------- > >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > >> For additional commands, e-mail: java-user-help@lucene.apache.org > >> > >> > > > > = --------------------------------------------------------------------- > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > > For additional commands, e-mail: java-user-help@lucene.apache.org > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org