lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Collins <chris_j_coll...@yahoo.com>
Subject Re: what if my database data contains other language (like danish, german).
Date Mon, 11 May 2009 15:28:06 GMT
Is anyone aware of either of the two things:

1) ability to plugin an external source for DF, this would allow you  
to circumvent the problem you mentioned below.  (Of course you would  
have to compute a df set for each language you care to have meaningful  
weights for).
2) any open source segmenters, primarily for german, but also for CJK  
at a longshot :-}

Thanks

C

On May 11, 2009, at 8:13 AM, Ted Dunning wrote:

> Yes.  Lucene can handle that.  You have to select which stemmer to  
> use.  You
> may have to improve the German and Danish stemmers a little bit.
>
> You may also have some issues with the fact that if Danish is 5% of  
> your
> corpus, then words that occur in 100% of your Danish documents will  
> tend to
> have too high weights since they only occur in 5% of your  
> documents.  Any
> term that occurs in more than 20% of a sub-corpus should generally be
> discarded from your query.  This can be difficult in multi-lingual
> situations.
>
> For a first pass, I would ignore this issue, however.
>
> On Mon, May 11, 2009 at 4:07 AM, uday kumar maddigatla  
> <ukma@mach.com>wrote:
>
>> what if my database data contains other language (like danish,  
>> german).
>>
>> Is Lucene will handle that .
>>
>> If yes How?
>>
>
>
>
> -- 
> Ted Dunning, CTO
> DeepDyve
>
> 111 West Evelyn Ave. Ste. 202
> Sunnyvale, CA 94086
> www.deepdyve.com
> 858-414-0013 (m)
> 408-773-0220 (fax)


Mime
View raw message