lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bob Carpenter <>
Subject Re: Analysis/tokenization of compound words (German, Chinese, etc.)
Date Tue, 21 Nov 2006 22:29:49 GMT
eks dev wrote:

> Depends what yo need to do with it, if you need this to be only used as "kind of stemming"
for searching documents, solution is not all that complex. If you need linguisticly correct
splitting than it gets complicated.

This is a very good point.  Stemming for
high recall is much easier than fine-grained
linguistic morphology.

Often the best solution is a combination of
best-guess based on linguistic rules/statistical
models/heuristics combined with weaker substring

> For beter solutions that would cover fuzzy errors, contact Bob Carpenter from Alias-I,
his SpellChecker can do this rather easily, unfortunatelly (for us) for money (Warning: I
am in no relatatin to Bob or Alias-I at all)...

The implementation we have is a simple character-level
noisy channel model.  We even have a tutorial for
how to do this in Chinese:

As pointed out in another thread, this requires a set of
training data consisting of the parts of the German
words.  And you may need to allow things other than
spaces to be dropped in cases of epenthesis (adding
a vowel between words).

It's also possible to bootstrap directly from
raw data, though only for the stemming for
high recall case -- you won't get close to the
true morphology this way.

Just to clarify, our LingPipe license is a dual
royalty-free/commercial license.  Our source is
downloadable online. The royalty free license
is very much like GPL with the added restriction that you
have to make public the data over which you run LingPipe.

- Bob Carpenter

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message