Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 27552 invoked from network); 14 May 2009 16:22:55 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 14 May 2009 16:22:55 -0000 Received: (qmail 86366 invoked by uid 500); 14 May 2009 16:22:52 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 86322 invoked by uid 500); 14 May 2009 16:22:52 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 86312 invoked by uid 99); 14 May 2009 16:22:52 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 14 May 2009 16:22:52 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of rcmuir@gmail.com designates 209.85.132.251 as permitted sender) Received: from [209.85.132.251] (HELO an-out-0708.google.com) (209.85.132.251) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 14 May 2009 16:22:43 +0000 Received: by an-out-0708.google.com with SMTP id b6so493744ana.5 for ; Thu, 14 May 2009 09:22:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type; bh=5YWyF4ZmqV/Gu4cwUkcj+mSbcqlhqxPD3z2xbcCDIW4=; b=A1gvWXEjcCoVef7Boe0S5uZTa1qqlSY/v8e5HbvbXS6CZY5J5mYyonYWzS6jiE55h5 TH/8tuS7NSylBWYe8P/uAoDDN/L3RCh59gcXGM6FteJJtNeeX5XMThfTktEZSIGU6rbg 1MZknb2HESyihrBBlaifYG8k1v2Cazlkj26/c= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=tiS1MLLQM+xLl8wO7tZwIAlN6mqriwyEj3xlJgs8cFjJKibMScXmCR8HR1A9+dL5vk jcp63nbJpssFuW70sQEep2EDHUrMaxuIPMtdgDQNOwoTN+PHKQs+zXX2ciTGJ9iokXgU u1HIG2x1uzh0zKIefUKfhF+gzl7qXkpt0VPGs= MIME-Version: 1.0 Received: by 10.100.208.8 with SMTP id f8mr3296812ang.64.1242318139493; Thu, 14 May 2009 09:22:19 -0700 (PDT) In-Reply-To: <24f32b230905140826q6fa35880i610d01269b071cba@mail.gmail.com> References: <24f32b230905140711h61c72892j22e272d49a0b9fda@mail.gmail.com> <8f0ad1f30905140747p5b7db36dq8d6dba23892fb088@mail.gmail.com> <24f32b230905140826q6fa35880i610d01269b071cba@mail.gmail.com> Date: Thu, 14 May 2009 12:22:19 -0400 Message-ID: <8f0ad1f30905140922u158f287fxc86b41d929e96ca4@mail.gmail.com> Subject: Re: Question wrt Lucene analyzer for different language From: Robert Muir To: java-user@lucene.apache.org Content-Type: multipart/alternative; boundary=001636af022e678a040469e1bb09 X-Virus-Checked: Checked by ClamAV on apache.org --001636af022e678a040469e1bb09 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit I would say in general, yes. when i say 'change arabic text', I mean the arabic analyzer will standardize and stem arabic words. but it won't modify any of your english words. and no, there is no case in arabic. this is why if you are handling mixed arabic/english text I recommend creating a custom analyzer that does some basics with the english part as well, such as lowercasefilter. On Thu, May 14, 2009 at 11:26 AM, weidong sun wrote: > Thanks for the quick answer. :-) > > So can I say, for ArabicAnalyzer, generally it can tokenize the mixed > content with Arabic and English? :-) > > I am not really familiar with Arabic language. What do you mean for "change > Arabic tokens"? Does Arabic has something like upper/lower case as English > does? > > > On Thu, May 14, 2009 at 10:47 AM, Robert Muir wrote: > > > in the case of ArabicAnalyzer it will only change Arabic tokens, and will > > leave english words as-is (it will not convert them to lowercase or > > anything > > like that) > > > > so if you want to have good Arabic and English behavior you would want to > > create a custom analyzer that looks like Arabic analyzer but also invokes > > lowercasefilter, perhaps also some english stemmer, etc etc. > > > > On Thu, May 14, 2009 at 10:11 AM, weidong sun wrote: > > > > > Hello, > > > > > > I am a newbie in Lucene world. I might ask some obvious question which > > > unfortunately I don't know the answer. Please help me 'grow'. > > > > > > We have a project intend to use Lucene search engine for search some > > user's > > > info stored our system. The user info might not be in English even it > > will > > > be stored in UTF-8 encoding. > > > > > > My question is, if I use one particular Lucene analyzer for a language > > > other > > > than English (e.g. ChineseAnalyzer or ArabicAnalyzer), can it still > able > > to > > > handle it correctly if user info is mixed with English character/word? > > > > > > Really appreciated with any answers. > > > > > > :-) > > > > > > > > > > > -- > > Robert Muir > > rcmuir@gmail.com > > > -- Robert Muir rcmuir@gmail.com --001636af022e678a040469e1bb09--