lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jonathan O'Connor" <jonathan.ocon...@xcom.de>
Subject Re: Analysis/tokenization of compound words
Date Tue, 19 Sep 2006 16:49:02 GMT
Otis,
I can't offer you any practical advice, but as a student of German, I can
tell you that beginners find it difficult to read German words and split
them properly. The larger your vocabulary the easier it is. The whole topic
sounds like an AI problem:
A possible algorithm for German (no idea if this would also work for
English or agglutinative languages like Turkish) might be:
1. Search for the whole word in the dictionary. If found end
2. Split the word into syllables (this might be another AI project too).
3. Join the syllables together and see if they make words in the
dictionary.
4. If all the syllables are used in known words, then you have success.
5. An heuristic to use is to create words as long as possible.

E.g. "Balletttänzerin" (Balletttaenzerin if you can't read umlauts).
Syllables: "Ball", "ett", "taenz", "er", "in"
Joining the syllables, we see that "Ball" is in our dictionary, but
"etttaenzerin", "etttaenzer" , "etttaenz" and "ett" are not. So on we go:
"Ballett" is in our dictionary, and "taenzerin" is also. Note if we went
for the short words first, then we could split it into: Ballett | taenzer |
in.

As usual, its an interesting project with no 100% perfect solution. Best of
luck
Jonathan O'Connor
XCOM Dublin


                                                                           
             Otis Gospodnetic                                              
             <otis_gospodnetic                                             
             @yahoo.com>                                                To 
                                       java-user@lucene.apache.org         
             19/09/2006 17:21                                           cc 
                                                                           
                                                                   Subject 
             Please respond to         Analysis/tokenization of compound   
             java-user@lucene.         words                               
                apache.org                                                 
                                                                           
                                                                           
                                                                           
                                                                           
                                                                           




Hi,

How do people typically analyze/tokenize text with compounds (e.g. German)?
I took a look at GermanAnalyzer hoping to see how one can deal with that,
but it turns out GermanAnalyzer doesn't treat compounds in any special way
at all.

One way to go about this is to have a word dictionary and a tokenizer that
processes input one character at a time, looking for a word match in the
dictionary after each processed characters.  Then, CompoundWordLikeThis
could be broken down into multiple tokens/words and returned at a set of
tokens at the same position.  However, somehow this doesn't strike me as a
very smart and fast approach.
What are some better approaches?
If anyone has implemented anything that deals with this problem, I'd love
to hear about it.

Thanks,
Otis



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org






*** XCOM AG Legal Disclaimer ***

Diese E-Mail einschliesslich ihrer Anhaenge ist vertraulich und ist allein
für den Gebrauch durch den vorgesehenen Empfaenger bestimmt. Dritten ist
das Lesen, Verteilen oder Weiterleiten dieser E-Mail untersagt. Wir bitten,
eine fehlgeleitete E-Mail unverzueglich vollstaendig zu loeschen und uns
eine Nachricht zukommen zu lassen.

This email may contain material that is confidential and for the sole use
of the intended recipient. Any review, distribution by others or forwarding
without express permission is strictly prohibited. If you are not the
intended recipient, please contact the sender and delete all copies.

Hauptsitz: Bahnstrasse 33, D-47877 Willich, USt-IdNr.: DE 812 885 664
Kommunikation: Telefon +49 2154 9209-70, Telefax +49 2154 9209-900,
www.xcom.de
Handelsregister: Amtsgericht Krefeld, HRB 10340
Vorstand: Matthias Albrecht, Renate Becker-Grope, Marco Marty, Dr. Rainer
Fuchs
Vorsitzender des Aufsichtsrates: Stephan Steuer
Mime
  • Unnamed multipart/related (inline, None, 0 bytes)
View raw message