lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Collins <chris_j_coll...@yahoo.com>
Subject Re: what if my database data contains other language (like danish, german).
Date Mon, 18 May 2009 02:17:28 GMT
Thanks Otis, I will take a look.

Best

C
On May 17, 2009, at 7:05 PM, Otis Gospodnetic wrote:

>
> Chris,
>
> I don't have the issue number here, but look in Lucene's JIRA and  
> search for... ah, here:
>
>  https://issues.apache.org/jira/browse/LUCENE-1166
>
>
> And for Chinese:
>
>  https://issues.apache.org/jira/browse/LUCENE-1629
>
> If you happen to be using Solr:
>
>  http://www.sematext.com/product-multilingual-analyzer.html
>
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
>
> ----- Original Message ----
>> From: Chris Collins <chris_j_collins@yahoo.com>
>> To: general@lucene.apache.org
>> Sent: Monday, May 11, 2009 11:28:06 AM
>> Subject: Re: what if my database data contains other language (like  
>> danish,  german).
>>
>> Is anyone aware of either of the two things:
>>
>> 1) ability to plugin an external source for DF, this would allow  
>> you to
>> circumvent the problem you mentioned below.  (Of course you would  
>> have to
>> compute a df set for each language you care to have meaningful  
>> weights for).
>> 2) any open source segmenters, primarily for german, but also for  
>> CJK at a
>> longshot :-}
>>
>> Thanks
>>
>> C
>>
>> On May 11, 2009, at 8:13 AM, Ted Dunning wrote:
>>
>>> Yes.  Lucene can handle that.  You have to select which stemmer to  
>>> use.  You
>>> may have to improve the German and Danish stemmers a little bit.
>>>
>>> You may also have some issues with the fact that if Danish is 5%  
>>> of your
>>> corpus, then words that occur in 100% of your Danish documents  
>>> will tend to
>>> have too high weights since they only occur in 5% of your  
>>> documents.  Any
>>> term that occurs in more than 20% of a sub-corpus should generally  
>>> be
>>> discarded from your query.  This can be difficult in multi-lingual
>>> situations.
>>>
>>> For a first pass, I would ignore this issue, however.
>>>
>>> On Mon, May 11, 2009 at 4:07 AM, uday kumar maddigatla wrote:
>>>
>>>> what if my database data contains other language (like danish,  
>>>> german).
>>>>
>>>> Is Lucene will handle that .
>>>>
>>>> If yes How?
>>>>
>>>
>>>
>>>
>>> --Ted Dunning, CTO
>>> DeepDyve
>>>
>>> 111 West Evelyn Ave. Ste. 202
>>> Sunnyvale, CA 94086
>>> www.deepdyve.com
>>> 858-414-0013 (m)
>>> 408-773-0220 (fax)
>


Mime
View raw message