lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "OBender Hotmail" <osya_ben...@hotmail.com>
Subject RE: Lucene and multi-lingual Unicode - advice needed
Date Tue, 16 Jun 2009 01:18:37 GMT
Here is the list of possible languages. Don't laugh :) I know those are almost all world languages
but it is a true requirement. Well, actual number will be closer to 70 not 100 but still I
don't really know which ones from the list below will end up in the DB.

-------
Afrikaans  Albanian Arabic Armenian Austrian Aymara Azerbaijani
Basque Belorussian Bemba Bengali Blackfoot Bosnian Breton Bulgarian     Canadian French  Catalan
Cebuano Chamorro Chinese Chechen Cornish Croatian Czech
Danish Dutch
Ecuadorian Quechua  English  English-Portuguese Esperanto Estonian
Faroese Farsi Finnish Flemish French Frisian 
Galician Georgian German Greek Guarani
Haitian Creole  Hausa Hawaiian Hebrew Hindi Hungarian Icelandic Indonesian     Inuktitut Irish
Italian
Japanese
Kazakh Kongo Korean
Latin Latvian Lithuanian Luganda Luxembourgish
Macedonian Malagasy Malay Maori Maya Mohawk Mongolian
Nahuatl Norwegian
Papago Pashto Pidgin English Polish Portuguese (European) 
Portuguese (Brazilian) Proven├žal
Quechua
Romanian Romansch Romany Ruanda Russian
Samoan Scottish Sepedi Serbian Shona Sicilian Slovak Slovene Somali    Sorbian Sotho Spanish
Swahili Swazi Swedish
Tagalog Tahitian Thai Tongan Tswana Turkish Turkmen Tuvan
Ukrainian Urdu Uzbek Vietnamese
Welsh Wolof
Xhosa
Yiddish Yoruba
Zulu

-----Original Message-----
From: Robert Muir [mailto:rcmuir@gmail.com] 
Sent: Monday, June 15, 2009 5:56 PM
To: java-user@lucene.apache.org
Subject: Re: Lucene and multi-lingual Unicode - advice needed

its not too bad, here would be a simple one that only breaks words on
whitespace and lowercases:

public class Example extends Analyzer {
 public TokenStream tokenStream(String fieldName, Reader reader) {
   TokenStream ts = new WhitespaceTokenizer(reader);
   ts = new LowerCaseFilter(ts);
   return ts;
 }
}

can you give a better idea as to what languages you have and what your
search requirements are (accent marks, punctuation, etc etc) ?

On Mon, Jun 15, 2009 at 5:39 PM, OBender Hotmail<osya_bender@hotmail.com> wrote:
> I've looked over SolR quickly, it is a bit too heavy for my project.
> So what is required (at a minimum) to build an analyzer, sandbox has a few of them varying
in complexity.
>
> -----Original Message-----
> From: Robert Muir [mailto:rcmuir@gmail.com]
> Sent: Monday, June 15, 2009 4:51 PM
> To: java-user@lucene.apache.org
> Subject: Re: Lucene and multi-lingual Unicode - advice needed
>
> Well just reply back if SolR is inappropriate for your needs.
>
> In that case, you will need to build a custom analyzer (its not too
> bad), so that you can use compass.
>
> On Mon, Jun 15, 2009 at 4:19 PM, OBender Hotmail<osya_bender@hotmail.com> wrote:
>> Hi,
>>
>> My goal is to find a framework that encapsulates as much low level indexing/search
technology as possible and have it integrate nicely with Spring.
>> It looked like Compass was/is a good encapsulation of the functionality. I'll take
a look at SolR though, thanks for the pointer.
>>
>> -----Original Message-----
>> From: Robert Muir [mailto:rcmuir@gmail.com]
>> Sent: Monday, June 15, 2009 1:14 PM
>> To: java-user@lucene.apache.org
>> Subject: Re: Lucene and multi-lingual Unicode - advice needed
>>
>> Hi,
>>
>> (Since this is an issue you brought up on the Compass forums)
>>
>> I wonder what stage you are in the development process?
>> Have you considered SolR, or does compass provide some other
>> functionality that you need?
>>
>> The reason I say this, is because the easiest solution might be to use
>> a nightly SolR for your application.
>>
>> I'm not personally biased one way or the other for any particular
>> framework, but recently there has been some improvements added to SolR
>> so that the default type 'text' is pretty good for multilingual
>> processing.
>>
>> In fact I hope in the future it will be improved in lucene so that
>> your decision is really based upon other application needs...
>>
>> On Mon, Jun 15, 2009 at 1:10 PM, OBender Hotmail<osya_bender@hotmail.com> wrote:
>>> Hi All!
>>>
>>>
>>>
>>> I'm new to Lucene so forgive me if this question was asked before.
>>>
>>> I have a database with records in the same table in many different languages
>>> (up to 70) it includes all W-European, Arabic, Eastern, CJK, Cyrillic, etc.
>>> you name it.
>>> I've looked at what people say about Lucene and it looks like for the most
>>> part standard analyzers should do fine with most Unicode languages but there
>>> are quite a few exceptions.
>>> Here is some recently updated Lucene Jira thread:
>>> https://issues.apache.org/jira/browse/LUCENE-1488
>>>
>>> My question is, what would be the safest bet for me in terms of
>>> analyzers/tokenizers?
>>> Do I really have to write my own ones for the bunch of languages that are
>>> not supported?
>>> Did anyone already solve the problem similar to mine? I'm sure someone
>>> already did :)
>>>
>>> And yes, I looked at the Lucene sandbox analyzers. It just adds more
>>> confusion. For example why there analyzers for DE and FR? Wouldn't the
>>> standard analyzer (which is Unicode complaint as I understood) deal with EU
>>> languages just fine?
>>>
>>> Thanks in advance for advices :)
>>>
>>>
>>
>>
>>
>> --
>> Robert Muir
>> rcmuir@gmail.com
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
>
>
> --
> Robert Muir
> rcmuir@gmail.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>



-- 
Robert Muir
rcmuir@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


No virus found in this incoming message.
Checked by AVG - www.avg.com 
Version: 8.5.339 / Virus Database: 270.12.62/2168 - Release Date: 06/15/09 17:54:00


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message