lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <li...@ehatchersolutions.com>
Subject Re: HTML Analyzer?
Date Fri, 15 Nov 2002 00:01:22 GMT
If you have a look at the HtmlDocument class in the ant contributions 
directory of jakarta-lucene-sandbox in Jakarta's CVS.

<http://cvs.apache.org/viewcvs.cgi/jakarta-lucene-sandbox/contributions/ant/src/main/org/apache/lucene/ant/HtmlDocument.java?annotate=1.1>

I wrote this and it uses JTidy to parse HTML and does a nice job of it.

Maybe this would be good for your solution as well?

	Erik


Lichty, Kent wrote:
> We have a web application that builds pages "on the fly" by reading directly
> from a database. The database contains both normal content and HTML.  We use
> Lucene as our search engine, but I need to figure out how to cause it to NOT
> include content that is within HTML tags. I assume that this entails the
> creation of a custom Analyzer.  Are there any existing Analyzers already out
> there that work like this? Thanks!
> 
> 
> 
> ----------  Internet E-mail Confidentiality Disclaimer  ----------
> 
> PRIVILEGED / CONFIDENTIAL INFORMATION may be contained in this message.  If
> you are not the addressee indicated in this message or the employee or agent
> responsible for delivering it to the addressee, you are hereby on notice
> that you are in possession of confidential and privileged information.  Any
> dissemination, distribution, or copying of this e-mail is strictly
> prohibited.  In such case, you should destroy this message and kindly notify
> the sender by reply e-mail.  Please advise immediately if you or your
> employer do not consent to Internet email for messages of this kind.
> 
> Opinions, conclusions, and other information in this message that do not
> relate to the official business of my firm shall be understood as neither
> given nor endorsed by it.
> 
> 
> 
> --
> To unsubscribe, e-mail:   <mailto:lucene-user-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
> 
> 
> 


--
To unsubscribe, e-mail:   <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>


Mime
View raw message