lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Peter A. Friend" <>
Subject Re: Hypenated word
Date Mon, 13 Jun 2005 15:22:44 GMT

On Jun 13, 2005, at 6:18 AM, Markus Wiederkehr wrote:

> I see, the list of exceptions makes this a lot more complicated than I
> thought... Thanks a lot, Erik!

There is a section about the problems that hyphens create in  
"Foundations of Statistical Natural Language Processing". Not only  
are the cases numerous, but seemingly simple rules such as joining  
hyphenated forms at the ends of lines does not always work. Sometimes  
the hyphen was added to break the word, sometimes you are already  
dealing with a hyphenated form that just happened to occur at the end  
of a line, so the hyphen serves two purposes. I've toyed with the  
idea of indexing hyphenated words in their raw as well as split  
forms, but I think that would wreak havoc on the word position stuff,  
as well as bloat the index with potentially meaningless gibberish.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message