lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "OBender Hotmail" <osya_ben...@hotmail.com>
Subject RE: Lucene and multi-lingual Unicode - advice needed
Date Tue, 16 Jun 2009 11:44:30 GMT
Yes, thanks! I'll start with a simple one as you described and test on the languages we have
at the moment.

-----Original Message-----
From: Robert Muir [mailto:rcmuir@gmail.com] 
Sent: Monday, June 15, 2009 10:45 PM
To: java-user@lucene.apache.org
Subject: Re: Lucene and multi-lingual Unicode - advice needed

ok, well at first i thought you must be playing a joke on me or something...

Maybe you want to create a lucene analyzer that mimic's solr defaults.

Search the mail archives for this recent thread, and KK posted his code:
Re: How to support stemming and case folding for english content mixed
with non-english content?

Then again, maybe the sample code i gave you (whitespace + lowercase)
is good enough. By the time the company in question manages to get its
Chamorro, Cornish, Blackfoot, and Pashto testers together to evaluate
the search you will be retired :)


On Mon, Jun 15, 2009 at 10:30 PM, OBender
Hotmail<osya_bender@hotmail.com> wrote:
> That's the thing there is no actual requirement.
> I've been presented with all the languages that company theoretically provides.
> My guess is that what I'm going to end up with is all western languages, good share of
Arabic family, complete set of Eastern and Eastern European ones and of course CJK.
>
> -----Original Message-----
> From: Robert Muir [mailto:rcmuir@gmail.com]
> Sent: Monday, June 15, 2009 9:52 PM
> To: java-user@lucene.apache.org
> Subject: Re: Lucene and multi-lingual Unicode - advice needed
>
> Really, you have a requirement that the system should search written Cornish?
>
> I think you might have larger problems!
>
> On Mon, Jun 15, 2009 at 9:18 PM, OBender Hotmail<osya_bender@hotmail.com> wrote:
>> Here is the list of possible languages. Don't laugh :) I know those are almost all
world languages but it is a true requirement. Well, actual number will be closer to 70 not
100 but still I don't really know which ones from the list below will end up in the DB.
>>
>> -------
>> Afrikaans  Albanian Arabic Armenian Austrian Aymara Azerbaijani
>> Basque Belorussian Bemba Bengali Blackfoot Bosnian Breton Bulgarian     Canadian
French  Catalan Cebuano Chamorro Chinese Chechen Cornish Croatian Czech
>> Danish Dutch
>> Ecuadorian Quechua  English  English-Portuguese Esperanto Estonian
>> Faroese Farsi Finnish Flemish French Frisian
>> Galician Georgian German Greek Guarani
>> Haitian Creole  Hausa Hawaiian Hebrew Hindi Hungarian Icelandic Indonesian     Inuktitut
Irish Italian
>> Japanese
>> Kazakh Kongo Korean
>> Latin Latvian Lithuanian Luganda Luxembourgish
>> Macedonian Malagasy Malay Maori Maya Mohawk Mongolian
>> Nahuatl Norwegian
>> Papago Pashto Pidgin English Polish Portuguese (European)
>> Portuguese (Brazilian) Provencal
>> Quechua
>> Romanian Romansch Romany Ruanda Russian
>> Samoan Scottish Sepedi Serbian Shona Sicilian Slovak Slovene Somali    Sorbian Sotho
Spanish Swahili Swazi Swedish
>> Tagalog Tahitian Thai Tongan Tswana Turkish Turkmen Tuvan
>> Ukrainian Urdu Uzbek Vietnamese
>> Welsh Wolof
>> Xhosa
>> Yiddish Yoruba
>> Zulu
>>
>> -----Original Message-----
>> From: Robert Muir [mailto:rcmuir@gmail.com]
>> Sent: Monday, June 15, 2009 5:56 PM
>> To: java-user@lucene.apache.org
>> Subject: Re: Lucene and multi-lingual Unicode - advice needed
>>
>> its not too bad, here would be a simple one that only breaks words on
>> whitespace and lowercases:
>>
>> public class Example extends Analyzer {
>>  public TokenStream tokenStream(String fieldName, Reader reader) {
>>   TokenStream ts = new WhitespaceTokenizer(reader);
>>   ts = new LowerCaseFilter(ts);
>>   return ts;
>>  }
>> }
>>
>> can you give a better idea as to what languages you have and what your
>> search requirements are (accent marks, punctuation, etc etc) ?
>>
>> On Mon, Jun 15, 2009 at 5:39 PM, OBender Hotmail<osya_bender@hotmail.com> wrote:
>>> I've looked over SolR quickly, it is a bit too heavy for my project.
>>> So what is required (at a minimum) to build an analyzer, sandbox has a few of
them varying in complexity.
>>>
>>> -----Original Message-----
>>> From: Robert Muir [mailto:rcmuir@gmail.com]
>>> Sent: Monday, June 15, 2009 4:51 PM
>>> To: java-user@lucene.apache.org
>>> Subject: Re: Lucene and multi-lingual Unicode - advice needed
>>>
>>> Well just reply back if SolR is inappropriate for your needs.
>>>
>>> In that case, you will need to build a custom analyzer (its not too
>>> bad), so that you can use compass.
>>>
>>> On Mon, Jun 15, 2009 at 4:19 PM, OBender Hotmail<osya_bender@hotmail.com>
wrote:
>>>> Hi,
>>>>
>>>> My goal is to find a framework that encapsulates as much low level indexing/search
technology as possible and have it integrate nicely with Spring.
>>>> It looked like Compass was/is a good encapsulation of the functionality.
I'll take a look at SolR though, thanks for the pointer.
>>>>
>>>> -----Original Message-----
>>>> From: Robert Muir [mailto:rcmuir@gmail.com]
>>>> Sent: Monday, June 15, 2009 1:14 PM
>>>> To: java-user@lucene.apache.org
>>>> Subject: Re: Lucene and multi-lingual Unicode - advice needed
>>>>
>>>> Hi,
>>>>
>>>> (Since this is an issue you brought up on the Compass forums)
>>>>
>>>> I wonder what stage you are in the development process?
>>>> Have you considered SolR, or does compass provide some other
>>>> functionality that you need?
>>>>
>>>> The reason I say this, is because the easiest solution might be to use
>>>> a nightly SolR for your application.
>>>>
>>>> I'm not personally biased one way or the other for any particular
>>>> framework, but recently there has been some improvements added to SolR
>>>> so that the default type 'text' is pretty good for multilingual
>>>> processing.
>>>>
>>>> In fact I hope in the future it will be improved in lucene so that
>>>> your decision is really based upon other application needs...
>>>>
>>>> On Mon, Jun 15, 2009 at 1:10 PM, OBender Hotmail<osya_bender@hotmail.com>
wrote:
>>>>> Hi All!
>>>>>
>>>>>
>>>>>
>>>>> I'm new to Lucene so forgive me if this question was asked before.
>>>>>
>>>>> I have a database with records in the same table in many different languages
>>>>> (up to 70) it includes all W-European, Arabic, Eastern, CJK, Cyrillic,
etc.
>>>>> you name it.
>>>>> I've looked at what people say about Lucene and it looks like for the
most
>>>>> part standard analyzers should do fine with most Unicode languages but
there
>>>>> are quite a few exceptions.
>>>>> Here is some recently updated Lucene Jira thread:
>>>>> https://issues.apache.org/jira/browse/LUCENE-1488
>>>>>
>>>>> My question is, what would be the safest bet for me in terms of
>>>>> analyzers/tokenizers?
>>>>> Do I really have to write my own ones for the bunch of languages that
are
>>>>> not supported?
>>>>> Did anyone already solve the problem similar to mine? I'm sure someone
>>>>> already did :)
>>>>>
>>>>> And yes, I looked at the Lucene sandbox analyzers. It just adds more
>>>>> confusion. For example why there analyzers for DE and FR? Wouldn't the
>>>>> standard analyzer (which is Unicode complaint as I understood) deal with
EU
>>>>> languages just fine?
>>>>>
>>>>> Thanks in advance for advices :)
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Robert Muir
>>>> rcmuir@gmail.com
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Robert Muir
>>> rcmuir@gmail.com
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>
>>
>>
>> --
>> Robert Muir
>> rcmuir@gmail.com
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>> No virus found in this incoming message.
>> Checked by AVG - www.avg.com
>> Version: 8.5.339 / Virus Database: 270.12.62/2168 - Release Date: 06/15/09 17:54:00
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
>
>
> --
> Robert Muir
> rcmuir@gmail.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>



-- 
Robert Muir
rcmuir@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


No virus found in this incoming message.
Checked by AVG - www.avg.com 
Version: 8.5.339 / Virus Database: 270.12.62/2168 - Release Date: 06/15/09 17:54:00


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message