Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 86408 invoked from network); 12 Apr 2005 12:10:44 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 12 Apr 2005 12:10:44 -0000 Received: (qmail 17443 invoked by uid 500); 12 Apr 2005 12:10:32 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 17421 invoked by uid 500); 12 Apr 2005 12:10:32 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 17403 invoked by uid 99); 12 Apr 2005 12:10:31 -0000 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received-SPF: pass (hermes.apache.org: local policy includes SPF record at spf.trusted-forwarder.org) Received: from smtp201.mail.sc5.yahoo.com (HELO smtp201.mail.sc5.yahoo.com) (216.136.129.91) by apache.org (qpsmtpd/0.28) with SMTP; Tue, 12 Apr 2005 05:10:30 -0700 Received: from unknown (HELO ?24.232.184.152?) (desantisernesto@24.232.184.152 with login) by smtp201.mail.sc5.yahoo.com with SMTP; 12 Apr 2005 12:10:26 -0000 Received: from 127.0.0.1 (AVG SMTP 7.0.308 [266.9.6]); Tue, 12 Apr 2005 09:12:48 -0300 Message-ID: <425BBB3F.9010907@colaborativa.net> Date: Tue, 12 Apr 2005 09:12:47 -0300 From: Ernesto De Santis Organization: Colaborativa.net User-Agent: Mozilla Thunderbird 1.0 (Windows/20041206) X-Accept-Language: es-ar, es, en-us, en MIME-Version: 1.0 To: java-user@lucene.apache.org Subject: Re: Multi-analyzer ? References: <200504111002.01179.mail@andy-roberts.net> <200504111613.53674.mail@andy-roberts.net> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 8bit X-Virus-Checked: Checked X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N Maybe you can use PerFieldAnalyzerWrapper. (I never used this) Ernesto. Eric Chow escribi�: >But how about one document contains more than two different languages ?? > > >Eric > >On Apr 12, 2005 12:13 AM, Andy Roberts wrote: > > >>On Monday 11 Apr 2005 14:55, Mike Baranczak wrote: >> >> >>>Your example with Arabic wouldn't work reliably either - there are >>>several other languages that use the Arabic script (Persian for >>>example). >>> >>> >>Good point. Although you could try a simple approach to test for the >>additional characters that exist in Persian but not in Arabic. Although, this >>again is not fool-proof. A letter-model approach would be better but is >>rather time consuming. >> >> >> >>>This is the sort of problem that the end user can solve much better >>>than the software can. >>> >>> >>> >>I completely agree, which is why I originally suggested prompting the user for >>this info. It may be the case that for the majority of queries, English is >>the usual language. And it is probably more feasible to do a test to >>determine whether the query English or not (still very tricky, mind). If not, >>then prompt the user to specify their input language because otherwise, >>results will be poor. >> >>Andy Roberts >> >> >> >>>-MB >>> >>>On Apr 11, 2005, at 6:02 AM, Andy Roberts wrote: >>> >>> >>>>Can you not provide the user with a option list to specify their input >>>>language? >>>> >>>>Language identification can be a pretty tricky field. There are some >>>>tricks >>>>you can do with unicode to identify language, e.g., \u0600 - \u06FF >>>>contains >>>>the Arabic characters, so if you're input contains lots of chars >>>>within this >>>>range, you can guess that the input is Arabic, for example. >>>> >>>>The problem comes with differentiating between the languages that use >>>>a Latin >>>>alphabet. Again, there are multiple approaches, although the only one >>>>I know >>>>of that worked pretty well for identifying European languages was to >>>>build a >>>>model based on character bigrams (that is, sequences of two letters) >>>>[1] >>>> >>>>At the end of the day, Lucene cannot help you in choosing the correct >>>>language >>>>as it doesn't know, and so it'll be up to you to add the necessary >>>>logic to >>>>tell Lucene which Analyzers to utilise. :( >>>> >>>>Andy >>>> >>>>[1] Churcher, G E; Hayes, J; Hughes, J S; Johnson, S; Souter, C. >>>>Bigram and >>>>trigram models for language identification and classification in: >>>>Evett, L & >>>>Rose,T (editors) Computational Linguistics for Speech and Handwriting >>>>Recognition AISB'94 Workshop University of Leeds/AISB. 1994. >>>> >>>>On Monday 11 Apr 2005 01:21, Eric Chow wrote: >>>> >>>> >>>>>Hello, >>>>> >>>>>If I don't know the language of the input terms, how can I use >>>>>different analyzer to search it ? >>>>> >>>>>For example, the input box accepts UTF-8 search text, they can be >>>>>anything, such as Chinese, Japanese, English, Russian, Deuch, etc. How >>>>>can search any of them or all of them with Lucene? >>>>> >>>>>Any example, please? >>>>> >>>>> >>>>>Best Regards, >>>>>Eric >>>>> >>>>>--------------------------------------------------------------------- >>>>>To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >>>>>For additional commands, e-mail: java-user-help@lucene.apache.org >>>>> >>>>> >>>>--------------------------------------------------------------------- >>>>To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >>>>For additional commands, e-mail: java-user-help@lucene.apache.org >>>> >>>> >>>--------------------------------------------------------------------- >>>To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >>>For additional commands, e-mail: java-user-help@lucene.apache.org >>> >>> >>--------------------------------------------------------------------- >>To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >>For additional commands, e-mail: java-user-help@lucene.apache.org >> >> >> >> > >--------------------------------------------------------------------- >To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >For additional commands, e-mail: java-user-help@lucene.apache.org > > > > -- Ernesto De Santis - Colaborativa.net C�rdoba 1147 Piso 6 Oficinas 3 y 4 (S2000AWO) Rosario, SF, Argentina. --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org