lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <otis_gospodne...@yahoo.com>
Subject Re: what if my database data contains other language (like danish, german).
Date Mon, 18 May 2009 02:05:10 GMT

Chris,

I don't have the issue number here, but look in Lucene's JIRA and search for... ah, here:

  https://issues.apache.org/jira/browse/LUCENE-1166


And for Chinese:

  https://issues.apache.org/jira/browse/LUCENE-1629

If you happen to be using Solr:

  http://www.sematext.com/product-multilingual-analyzer.html

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
> From: Chris Collins <chris_j_collins@yahoo.com>
> To: general@lucene.apache.org
> Sent: Monday, May 11, 2009 11:28:06 AM
> Subject: Re: what if my database data contains other language (like danish,  german).
> 
> Is anyone aware of either of the two things:
> 
> 1) ability to plugin an external source for DF, this would allow you to 
> circumvent the problem you mentioned below.  (Of course you would have to 
> compute a df set for each language you care to have meaningful weights for).
> 2) any open source segmenters, primarily for german, but also for CJK at a 
> longshot :-}
> 
> Thanks
> 
> C
> 
> On May 11, 2009, at 8:13 AM, Ted Dunning wrote:
> 
> > Yes.  Lucene can handle that.  You have to select which stemmer to use.  You
> > may have to improve the German and Danish stemmers a little bit.
> > 
> > You may also have some issues with the fact that if Danish is 5% of your
> > corpus, then words that occur in 100% of your Danish documents will tend to
> > have too high weights since they only occur in 5% of your documents.  Any
> > term that occurs in more than 20% of a sub-corpus should generally be
> > discarded from your query.  This can be difficult in multi-lingual
> > situations.
> > 
> > For a first pass, I would ignore this issue, however.
> > 
> > On Mon, May 11, 2009 at 4:07 AM, uday kumar maddigatla wrote:
> > 
> >> what if my database data contains other language (like danish, german).
> >> 
> >> Is Lucene will handle that .
> >> 
> >> If yes How?
> >> 
> > 
> > 
> > 
> > --Ted Dunning, CTO
> > DeepDyve
> > 
> > 111 West Evelyn Ave. Ste. 202
> > Sunnyvale, CA 94086
> > www.deepdyve.com
> > 858-414-0013 (m)
> > 408-773-0220 (fax)


Mime
View raw message