can you share a description of the heuristics you used to clean up the text? i am facing the
same problem right now handling email. i'm not interested in the rules you use as much as
the tools you use to implement the rules.
Herb....
-----Original Message-----
From: Ulrich Mayring [mailto:ulim@denic.de]
Sent: Friday, November 28, 2003 4:21 AM
To: lucene-user@jakarta.apache.org
Subject: Re: New Lucene-powered Website
This "clean-up work" is actually trickier than the summarising itself
and it is usually very domain-specific. That's the reason why I haven't
proposed to contribute the summariser to Lucene, because the clean-up
code is not generic. The summariser itself is just one class with 300
lines, but without prior clean-up the quality of its summaries is
insufficient.
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
|