lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <>
Subject Re: what if my database data contains other language (like danish, german).
Date Mon, 11 May 2009 16:23:07 GMT
On Mon, May 11, 2009 at 8:28 AM, Chris Collins <>wrote:

> Is anyone aware of either of the two things:
> 1) ability to plugin an external source for DF, this would allow you to
> circumvent the problem you mentioned below.  (Of course you would have to
> compute a df set for each language you care to have meaningful weights for).


The typical idiom is to extend Searcher with a specialized structure that
knows the term frequencies that you want it to know.

This is what katta does to propagate cluster-global term frequencies to
shard specific searches.  Presumably solr does likewise.

> 2) any open source segmenters, primarily for german, but also for CJK at a
> longshot :-}

Lucene has a rudimentary german stemmer which may be sufficient.  Real lemma
identification in German can be difficult because of the large number of
morphological variants and word compounding.  For text retrieval, however,
compounding is your friend and very simple stemmers typically suffice.

For CJK, the approach that I favor lately is this one:

Basically, it is a longest dictionary match method with the addition that it
picks the next token that is part of the longest match for the next three
tokens.  This gets rid of the garden path problems that greedy algorithms
without look-ahead have.  It depends a bit on the assumption that long words
in the dictionary have higher frequency than would be expected if the
possible components occur independently.  This means that picking the longer
match in the dictionary is equivalent to doing a more subtle statistical
test.  (See here for more details on the stats involved in bigram detection:

Ted Dunning, CTO

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message