Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 76395 invoked from network); 16 Jun 2009 11:44:54 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 16 Jun 2009 11:44:54 -0000 Received: (qmail 66347 invoked by uid 500); 16 Jun 2009 11:45:03 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 66286 invoked by uid 500); 16 Jun 2009 11:45:03 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 66276 invoked by uid 99); 16 Jun 2009 11:45:03 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 16 Jun 2009 11:45:03 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of osya_bender@hotmail.com designates 65.54.246.232 as permitted sender) Received: from [65.54.246.232] (HELO bay0-omc3-s32.bay0.hotmail.com) (65.54.246.232) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 16 Jun 2009 11:44:51 +0000 Received: from hotmail.com ([207.46.9.17]) by bay0-omc3-s32.bay0.hotmail.com with Microsoft SMTPSVC(6.0.3790.3959); Tue, 16 Jun 2009 04:44:31 -0700 Received: from mail pickup service by hotmail.com with Microsoft SMTPSVC; Tue, 16 Jun 2009 04:44:30 -0700 Message-ID: Received: from 24.61.92.31 by BAY119-DAV7.phx.gbl with DAV; Tue, 16 Jun 2009 11:44:28 +0000 X-Originating-IP: [24.61.92.31] X-Originating-Email: [osya_bender@hotmail.com] X-Sender: osya_bender@hotmail.com From: "OBender Hotmail" To: References: <8f0ad1f30906151014n2563ab94l789951b0f33612cd@mail.gmail.com> <8f0ad1f30906151351obb93193v7c5049fe59c29aeb@mail.gmail.com> <8f0ad1f30906151455o3b8c0dafu7a70d52fa4469d6@mail.gmail.com> <8f0ad1f30906151851p561f797elcb737f3eb558f750@mail.gmail.com> <8f0ad1f30906151945w2c3ce5b5sc9bd6556aad3e433@mail.gmail.com> Subject: RE: Lucene and multi-lingual Unicode - advice needed Date: Tue, 16 Jun 2009 07:44:30 -0400 Message-ID: <400B4A15612844869B30995AAA46A189@VELADEV> MIME-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Mailer: Microsoft Office Outlook 11 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.5579 thread-index: AcnuLIvoU093VqzFTbWpYEdGqzinmwASyBEQ In-Reply-To: <8f0ad1f30906151945w2c3ce5b5sc9bd6556aad3e433@mail.gmail.com> X-OriginalArrivalTime: 16 Jun 2009 11:44:30.0850 (UTC) FILETIME=[CF9D9620:01C9EE77] X-Virus-Checked: Checked by ClamAV on apache.org Yes, thanks! I'll start with a simple one as you described and test on = the languages we have at the moment. -----Original Message----- From: Robert Muir [mailto:rcmuir@gmail.com]=20 Sent: Monday, June 15, 2009 10:45 PM To: java-user@lucene.apache.org Subject: Re: Lucene and multi-lingual Unicode - advice needed ok, well at first i thought you must be playing a joke on me or = something... Maybe you want to create a lucene analyzer that mimic's solr defaults. Search the mail archives for this recent thread, and KK posted his code: Re: How to support stemming and case folding for english content mixed with non-english content? Then again, maybe the sample code i gave you (whitespace + lowercase) is good enough. By the time the company in question manages to get its Chamorro, Cornish, Blackfoot, and Pashto testers together to evaluate the search you will be retired :) On Mon, Jun 15, 2009 at 10:30 PM, OBender Hotmail wrote: > That's the thing there is no actual requirement. > I've been presented with all the languages that company theoretically = provides. > My guess is that what I'm going to end up with is all western = languages, good share of Arabic family, complete set of Eastern and = Eastern European ones and of course CJK. > > -----Original Message----- > From: Robert Muir [mailto:rcmuir@gmail.com] > Sent: Monday, June 15, 2009 9:52 PM > To: java-user@lucene.apache.org > Subject: Re: Lucene and multi-lingual Unicode - advice needed > > Really, you have a requirement that the system should search written = Cornish? > > I think you might have larger problems! > > On Mon, Jun 15, 2009 at 9:18 PM, OBender = Hotmail wrote: >> Here is the list of possible languages. Don't laugh :) I know those = are almost all world languages but it is a true requirement. Well, = actual number will be closer to 70 not 100 but still I don't really know = which ones from the list below will end up in the DB. >> >> ------- >> Afrikaans Albanian Arabic Armenian Austrian Aymara Azerbaijani >> Basque Belorussian Bemba Bengali Blackfoot Bosnian Breton Bulgarian = Canadian French Catalan Cebuano Chamorro Chinese Chechen Cornish = Croatian Czech >> Danish Dutch >> Ecuadorian Quechua English English-Portuguese Esperanto Estonian >> Faroese Farsi Finnish Flemish French Frisian >> Galician Georgian German Greek Guarani >> Haitian Creole Hausa Hawaiian Hebrew Hindi Hungarian Icelandic = Indonesian Inuktitut Irish Italian >> Japanese >> Kazakh Kongo Korean >> Latin Latvian Lithuanian Luganda Luxembourgish >> Macedonian Malagasy Malay Maori Maya Mohawk Mongolian >> Nahuatl Norwegian >> Papago Pashto Pidgin English Polish Portuguese (European) >> Portuguese (Brazilian) Provencal >> Quechua >> Romanian Romansch Romany Ruanda Russian >> Samoan Scottish Sepedi Serbian Shona Sicilian Slovak Slovene Somali = Sorbian Sotho Spanish Swahili Swazi Swedish >> Tagalog Tahitian Thai Tongan Tswana Turkish Turkmen Tuvan >> Ukrainian Urdu Uzbek Vietnamese >> Welsh Wolof >> Xhosa >> Yiddish Yoruba >> Zulu >> >> -----Original Message----- >> From: Robert Muir [mailto:rcmuir@gmail.com] >> Sent: Monday, June 15, 2009 5:56 PM >> To: java-user@lucene.apache.org >> Subject: Re: Lucene and multi-lingual Unicode - advice needed >> >> its not too bad, here would be a simple one that only breaks words on >> whitespace and lowercases: >> >> public class Example extends Analyzer { >> public TokenStream tokenStream(String fieldName, Reader reader) { >> TokenStream ts =3D new WhitespaceTokenizer(reader); >> ts =3D new LowerCaseFilter(ts); >> return ts; >> } >> } >> >> can you give a better idea as to what languages you have and what = your >> search requirements are (accent marks, punctuation, etc etc) ? >> >> On Mon, Jun 15, 2009 at 5:39 PM, OBender = Hotmail wrote: >>> I've looked over SolR quickly, it is a bit too heavy for my project. >>> So what is required (at a minimum) to build an analyzer, sandbox has = a few of them varying in complexity. >>> >>> -----Original Message----- >>> From: Robert Muir [mailto:rcmuir@gmail.com] >>> Sent: Monday, June 15, 2009 4:51 PM >>> To: java-user@lucene.apache.org >>> Subject: Re: Lucene and multi-lingual Unicode - advice needed >>> >>> Well just reply back if SolR is inappropriate for your needs. >>> >>> In that case, you will need to build a custom analyzer (its not too >>> bad), so that you can use compass. >>> >>> On Mon, Jun 15, 2009 at 4:19 PM, OBender = Hotmail wrote: >>>> Hi, >>>> >>>> My goal is to find a framework that encapsulates as much low level = indexing/search technology as possible and have it integrate nicely with = Spring. >>>> It looked like Compass was/is a good encapsulation of the = functionality. I'll take a look at SolR though, thanks for the pointer. >>>> >>>> -----Original Message----- >>>> From: Robert Muir [mailto:rcmuir@gmail.com] >>>> Sent: Monday, June 15, 2009 1:14 PM >>>> To: java-user@lucene.apache.org >>>> Subject: Re: Lucene and multi-lingual Unicode - advice needed >>>> >>>> Hi, >>>> >>>> (Since this is an issue you brought up on the Compass forums) >>>> >>>> I wonder what stage you are in the development process? >>>> Have you considered SolR, or does compass provide some other >>>> functionality that you need? >>>> >>>> The reason I say this, is because the easiest solution might be to = use >>>> a nightly SolR for your application. >>>> >>>> I'm not personally biased one way or the other for any particular >>>> framework, but recently there has been some improvements added to = SolR >>>> so that the default type 'text' is pretty good for multilingual >>>> processing. >>>> >>>> In fact I hope in the future it will be improved in lucene so that >>>> your decision is really based upon other application needs... >>>> >>>> On Mon, Jun 15, 2009 at 1:10 PM, OBender = Hotmail wrote: >>>>> Hi All! >>>>> >>>>> >>>>> >>>>> I'm new to Lucene so forgive me if this question was asked before. >>>>> >>>>> I have a database with records in the same table in many different = languages >>>>> (up to 70) it includes all W-European, Arabic, Eastern, CJK, = Cyrillic, etc. >>>>> you name it. >>>>> I've looked at what people say about Lucene and it looks like for = the most >>>>> part standard analyzers should do fine with most Unicode languages = but there >>>>> are quite a few exceptions. >>>>> Here is some recently updated Lucene Jira thread: >>>>> https://issues.apache.org/jira/browse/LUCENE-1488 >>>>> >>>>> My question is, what would be the safest bet for me in terms of >>>>> analyzers/tokenizers? >>>>> Do I really have to write my own ones for the bunch of languages = that are >>>>> not supported? >>>>> Did anyone already solve the problem similar to mine? I'm sure = someone >>>>> already did :) >>>>> >>>>> And yes, I looked at the Lucene sandbox analyzers. It just adds = more >>>>> confusion. For example why there analyzers for DE and FR? Wouldn't = the >>>>> standard analyzer (which is Unicode complaint as I understood) = deal with EU >>>>> languages just fine? >>>>> >>>>> Thanks in advance for advices :) >>>>> >>>>> >>>> >>>> >>>> >>>> -- >>>> Robert Muir >>>> rcmuir@gmail.com >>>> >>>> = --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >>>> For additional commands, e-mail: java-user-help@lucene.apache.org >>>> >>>> >>>> >>>> = --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >>>> For additional commands, e-mail: java-user-help@lucene.apache.org >>>> >>>> >>> >>> >>> >>> -- >>> Robert Muir >>> rcmuir@gmail.com >>> >>> = --------------------------------------------------------------------- >>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >>> For additional commands, e-mail: java-user-help@lucene.apache.org >>> >>> >>> >>> = --------------------------------------------------------------------- >>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >>> For additional commands, e-mail: java-user-help@lucene.apache.org >>> >>> >> >> >> >> -- >> Robert Muir >> rcmuir@gmail.com >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >> For additional commands, e-mail: java-user-help@lucene.apache.org >> >> >> No virus found in this incoming message. >> Checked by AVG - www.avg.com >> Version: 8.5.339 / Virus Database: 270.12.62/2168 - Release Date: = 06/15/09 17:54:00 >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >> For additional commands, e-mail: java-user-help@lucene.apache.org >> >> > > > > -- > Robert Muir > rcmuir@gmail.com > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > > --=20 Robert Muir rcmuir@gmail.com --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org No virus found in this incoming message. Checked by AVG - www.avg.com=20 Version: 8.5.339 / Virus Database: 270.12.62/2168 - Release Date: = 06/15/09 17:54:00 --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org