lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From eks dev <eks...@yahoo.co.uk>
Subject Re: Analysis/tokenization of compound words
Date Tue, 19 Sep 2006 20:15:04 GMT
Hi Otis,
Depends what yo need to do with it, if you need this to be only used as "kind of stemming"
for searching documents, solution is not all that complex. If you need linguisticly correct
splitting than it gets complicated.

for the first case:
Build SuffixTree with your dictionary (hope you have many inflections for german words in
your dictionary...(feminin, masculin, plural, n-ending, 4 cases...), Tanzerin Tanzer). find
longest suffix that is in your dictionary and recursively strip word that ends original word...
It is fast.

If I remember correctly, in lucene util is some SuffixTree implementation (not really good
for large dictionaries)

Thigs to be aware of, your recall will drop down in case you use simple fuzzy things that
are normally found.

- "Balletttänzerin" -> "Ballett" "tänzerin", so if your request does not get split due
to typos no chance to find it, e.g. "Ballettänzerim"->"Ballettänzerim"

- You need good dictionary with all inflections (google morphy or something like this to help
you generate all forms )

- try to be carefull with short prefix in this case as this leads to totally wrong splitting
"umbau"->"um" "bau" (changes emning, and if you have preposition "um" as stopword...)

For beter solutions that would cover fuzzy errors, contact Bob Carpenter from Alias-I, his
SpellChecker can do this rather easily, unfortunatelly (for us) for money (Warning: I am in
no relatatin to Bob or Alias-I at all)...

Daniel Naber made some work with German dictionaries as well, if I recall well, maybe he has
something that helps

Anyhow, if you opt for the first option, I will try to dig something out in our archives,
we did something similar ages ago ("stemming like" splitting of word in German)

Have fun, e.

----- Original Message ----
From: Otis Gospodnetic <otis_gospodnetic@yahoo.com>
To: java-user@lucene.apache.org
Sent: Tuesday, 19 September, 2006 6:21:55 PM
Subject: Analysis/tokenization of compound words

Hi,

How do people typically analyze/tokenize text with compounds (e.g. German)?  I took a look
at GermanAnalyzer hoping to see how one can deal with that, but it turns out GermanAnalyzer
doesn't treat compounds in any special way at all.

One way to go about this is to have a word dictionary and a tokenizer that processes input
one character at a time, looking for a word match in the dictionary after each processed characters.
 Then, CompoundWordLikeThis could be broken down into multiple tokens/words and returned at
a set of tokens at the same position.  However, somehow this doesn't strike me as a very smart
and fast approach.
What are some better approaches?
If anyone has implemented anything that deals with this problem, I'd love to hear about it.

Thanks,
Otis



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org





---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message