lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From eks dev <>
Subject Re: Analysis/tokenization of compound words
Date Tue, 19 Sep 2006 20:41:33 GMT
I just remembered now on minor thing that made our life easier, recusive loop has some primitive

stripEndings() method that removes most of variable endings all these ungs/ungen/... before
looking up in SuffixTree. This reduces your dictionary needs dramatically. I think this is
partially done in GermanStemmer in Lucene...

ahh, another one, when you strip suffix, check if last char on remaining "stem" is "s" (magic
thing in German), delete it if not the only letter.... do not ask why, long unexplained mistery
of German language

this approach works in 99% cases, and special linguistic tricks are anyhow not so relevant
for most situations for searching. Regular stemmer makes much greater distorsion than this

Must find this code somewhere, I probably left something out in these emails

----- Original Message ----
From: eks dev <>
Sent: Tuesday, 19 September, 2006 10:15:04 PM
Subject: Re: Analysis/tokenization of compound words

Hi Otis,
Depends what yo need to do with it, if you need this to be only used as "kind of stemming"
for searching documents, solution is not all that complex. If you need linguisticly correct
splitting than it gets complicated.

for the first case:
Build SuffixTree with your dictionary (hope you have many inflections for german words in
your dictionary...(feminin, masculin, plural, n-ending, 4 cases...), Tanzerin Tanzer). find
longest suffix that is in your dictionary and recursively strip word that ends original word...
It is fast.

If I remember correctly, in lucene util is some SuffixTree implementation (not really good
for large dictionaries)

Thigs to be aware of, your recall will drop down in case you use simple fuzzy things that
are normally found.

- "Balletttänzerin" -> "Ballett" "tänzerin", so if your request does not get split due
to typos no chance to find it, e.g. "Ballettänzerim"->"Ballettänzerim"

- You need good dictionary with all inflections (google morphy or something like this to help
you generate all forms )

- try to be carefull with short prefix in this case as this leads to totally wrong splitting
"umbau"->"um" "bau" (changes emning, and if you have preposition "um" as stopword...)

For beter solutions that would cover fuzzy errors, contact Bob Carpenter from Alias-I, his
SpellChecker can do this rather easily, unfortunatelly (for us) for money (Warning: I am in
no relatatin to Bob or Alias-I at all)...

Daniel Naber made some work with German dictionaries as well, if I recall well, maybe he has
something that helps

Anyhow, if you opt for the first option, I will try to dig something out in our archives,
we did something similar ages ago ("stemming like" splitting of word in German)

Have fun, e.

----- Original Message ----
From: Otis Gospodnetic <>
Sent: Tuesday, 19 September, 2006 6:21:55 PM
Subject: Analysis/tokenization of compound words


How do people typically analyze/tokenize text with compounds (e.g. German)?  I took a look
at GermanAnalyzer hoping to see how one can deal with that, but it turns out GermanAnalyzer
doesn't treat compounds in any special way at all.

One way to go about this is to have a word dictionary and a tokenizer that processes input
one character at a time, looking for a word match in the dictionary after each processed characters.
 Then, CompoundWordLikeThis could be broken down into multiple tokens/words and returned at
a set of tokens at the same position.  However, somehow this doesn't strike me as a very smart
and fast approach.
What are some better approaches?
If anyone has implemented anything that deals with this problem, I'd love to hear about it.


To unsubscribe, e-mail:
For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message