lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ulrich Mayring <>
Subject Re: New Lucene-powered Website
Date Mon, 01 Dec 2003 14:13:27 GMT
Chong, Herb wrote:
> can you share a description of the heuristics you used to clean up the text? i am facing
the same problem right now handling email. i'm not interested in the rules you use as much
as the tools you use to implement the rules.

The tools... well, Java ;-)

The search engine is a custom Java application, which uses Lucene. The 
heuristics are not very general at this point, they are tailored to our 
domain. So what you are hinting at (a generic rules description language 
to customize to the local domain) seems appropriate. Our rules are 
things like "anything within <h1>...</h1> is an important sentence and 
we add a full-stop at the end".


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message