lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andy Roberts <m...@andy-roberts.net>
Subject Re: Hypenated word
Date Mon, 13 Jun 2005 16:31:41 GMT
On Monday 13 Jun 2005 14:52, Markus Wiederkehr wrote:
> On 6/13/05, Andy Roberts <mail@andy-roberts.net> wrote:
> > On Monday 13 Jun 2005 13:18, Markus Wiederkehr wrote:
> > > I see, the list of exceptions makes this a lot more complicated than I
> > > thought... Thanks a lot, Erik!
> >
> > I expect you'll need to do some pre-processing. Read in your text into a
> > buffer, line-by-line. If a given line ends with a hyphen, you can
> > manipulate the buffer to merge the hyphenated tokens.
>
> As Erik wrote it is not that simple, unfortunately. For example, if
> one line ends with "read-" and the next line begins with "only" the
> correct word is "read-only" not "readonly". Whereas "work-" and "ing"
> should of course be merged into "working".
>
> Markus

Perhaps you do some crude checking against a dictionary. Combine the word 
anyway and check if it's in the dictionary. If so, keep it merged otherwise, 
it's a compound and so revert back to the hyphenated form.

Word lists come part of all good OSS dictionary projects, as well as other 
language resources, like the BNC word lists etc.

Andy

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message