Chong, Herb wrote:
> can you share a description of the heuristics you used to clean up the text? i am facing
the same problem right now handling email. i'm not interested in the rules you use as much
as the tools you use to implement the rules.
The tools... well, Java ;-)
The search engine is a custom Java application, which uses Lucene. The
heuristics are not very general at this point, they are tailored to our
domain. So what you are hinting at (a generic rules description language
to customize to the local domain) seems appropriate. Our rules are
things like "anything within <h1>...</h1> is an important sentence and
we add a full-stop at the end".
Ulrich
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
|