Return-Path: Delivered-To: apmail-jakarta-lucene-user-archive@apache.org Received: (qmail 3875 invoked from network); 15 Nov 2002 00:01:28 -0000 Received: from unknown (HELO nagoya.betaversion.org) (192.18.49.131) by daedalus.apache.org with SMTP; 15 Nov 2002 00:01:28 -0000 Received: (qmail 12697 invoked by uid 97); 15 Nov 2002 00:02:28 -0000 Delivered-To: qmlist-jakarta-archive-lucene-user@jakarta.apache.org Received: (qmail 12676 invoked by uid 97); 15 Nov 2002 00:02:27 -0000 Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Users List" Reply-To: "Lucene Users List" Delivered-To: mailing list lucene-user@jakarta.apache.org Received: (qmail 12651 invoked by uid 98); 15 Nov 2002 00:02:26 -0000 X-Antivirus: nagoya (v4218 created Aug 14 2002) X-Sent: 15 Nov 2002 00:01:21 GMT Message-ID: <3DD43952.3030801@ehatchersolutions.com> Date: Thu, 14 Nov 2002 19:01:22 -0500 From: Erik Hatcher User-Agent: Mozilla/5.0 (Macintosh; U; PPC Mac OS X; en-US; rv:1.1) Gecko/20020826 X-Accept-Language: en-us, en MIME-Version: 1.0 To: Lucene Users List Subject: Re: HTML Analyzer? References: <13D2388EC2C4F04EB343EA2674BC20F22A0395@kc1exusr01.mail.dsionline.com> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N If you have a look at the HtmlDocument class in the ant contributions directory of jakarta-lucene-sandbox in Jakarta's CVS. I wrote this and it uses JTidy to parse HTML and does a nice job of it. Maybe this would be good for your solution as well? Erik Lichty, Kent wrote: > We have a web application that builds pages "on the fly" by reading directly > from a database. The database contains both normal content and HTML. We use > Lucene as our search engine, but I need to figure out how to cause it to NOT > include content that is within HTML tags. I assume that this entails the > creation of a custom Analyzer. Are there any existing Analyzers already out > there that work like this? Thanks! > > > > ---------- Internet E-mail Confidentiality Disclaimer ---------- > > PRIVILEGED / CONFIDENTIAL INFORMATION may be contained in this message. If > you are not the addressee indicated in this message or the employee or agent > responsible for delivering it to the addressee, you are hereby on notice > that you are in possession of confidential and privileged information. Any > dissemination, distribution, or copying of this e-mail is strictly > prohibited. In such case, you should destroy this message and kindly notify > the sender by reply e-mail. Please advise immediately if you or your > employer do not consent to Internet email for messages of this kind. > > Opinions, conclusions, and other information in this message that do not > relate to the official business of my firm shall be understood as neither > given nor endorsed by it. > > > > -- > To unsubscribe, e-mail: > For additional commands, e-mail: > > > -- To unsubscribe, e-mail: For additional commands, e-mail: