lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Binkley, Peter" <>
Subject RE: Analysis/tokenization of compound words
Date Thu, 21 Sep 2006 18:38:18 GMT
Aspell has some support for compound words that might be useful to look


Peter Binkley
Digital Initiatives Technology Librarian
Information Technology Services
4-30 Cameron Library
University of Alberta Libraries
Edmonton, Alberta
Canada T6G 2J8
Phone: (780) 492-3743
Fax: (780) 492-9243


-----Original Message-----
From: Otis Gospodnetic [] 
Sent: Tuesday, September 19, 2006 10:22 AM
Subject: Analysis/tokenization of compound words


How do people typically analyze/tokenize text with compounds (e.g.
German)?  I took a look at GermanAnalyzer hoping to see how one can deal
with that, but it turns out GermanAnalyzer doesn't treat compounds in
any special way at all.

One way to go about this is to have a word dictionary and a tokenizer
that processes input one character at a time, looking for a word match
in the dictionary after each processed characters.  Then,
CompoundWordLikeThis could be broken down into multiple tokens/words and
returned at a set of tokens at the same position.  However, somehow this
doesn't strike me as a very smart and fast approach.
What are some better approaches?
If anyone has implemented anything that deals with this problem, I'd
love to hear about it.


To unsubscribe, e-mail:
For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message