Return-Path: Delivered-To: apmail-jakarta-lucene-user-archive@www.apache.org Received: (qmail 68371 invoked from network); 2 Feb 2005 13:22:51 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur-2.apache.org with SMTP; 2 Feb 2005 13:22:51 -0000 Received: (qmail 94008 invoked by uid 500); 2 Feb 2005 13:22:37 -0000 Delivered-To: apmail-jakarta-lucene-user-archive@jakarta.apache.org Received: (qmail 93823 invoked by uid 500); 2 Feb 2005 13:22:36 -0000 Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Users List" Reply-To: "Lucene Users List" Delivered-To: mailing list lucene-user@jakarta.apache.org Received: (qmail 93749 invoked by uid 99); 2 Feb 2005 13:22:35 -0000 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received-SPF: neutral (hermes.apache.org: local policy) Received: from fork6.mail.Virginia.EDU (HELO fork6.mail.virginia.edu) (128.143.2.176) by apache.org (qpsmtpd/0.28) with ESMTP; Wed, 02 Feb 2005 05:22:35 -0800 Received: from localhost (localhost [127.0.0.1]) by fork6.mail.virginia.edu (Postfix) with ESMTP id D98731C117 for ; Wed, 2 Feb 2005 08:22:31 -0500 (EST) Received: from fork6.mail.virginia.edu ([127.0.0.1]) by localhost (fork6.mail.virginia.edu [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 27783-09 for ; Wed, 2 Feb 2005 08:22:31 -0500 (EST) Received: from [128.143.167.108] (d-128-167-108.bootp.Virginia.EDU [128.143.167.108]) by fork6.mail.virginia.edu (Postfix) with ESMTP id 942DA1C114 for ; Wed, 2 Feb 2005 08:22:31 -0500 (EST) Mime-Version: 1.0 (Apple Message framework v619.2) In-Reply-To: <31052.1107343038@www8.gmx.net> References: <31052.1107343038@www8.gmx.net> Content-Type: text/plain; charset=US-ASCII; format=flowed Message-Id: <65e3e9be961dfb08444b5cf4920e6b31@ehatchersolutions.com> Content-Transfer-Encoding: 7bit From: Erik Hatcher Subject: Re: which HTML parser is better? Date: Wed, 2 Feb 2005 08:22:30 -0500 To: "Lucene Users List" X-Mailer: Apple Mail (2.619.2) X-UVA-Virus-Scanned: by amavisd-new at fork6.mail.virginia.edu X-Virus-Checked: Checked X-Spam-Rating: minotaur-2.apache.org 1.6.2 0/1000/N On Feb 2, 2005, at 6:17 AM, Karl Koch wrote: > Hello, > > I have been following this thread and have another question. > > Is there a piece of sourcecode (which is preferably very short and > simple > (KISS)) which allows to remove all HTML tags from HTML content? HTML > 3.2 > would be enough...also no frames, CSS, etc. > > I do not need to have the HTML strucutre tree or any other structure > but > need a facility to clean up HTML into its normal underlying content > before > indexing that content as a whole. > The code in the Lucene Sandbox for parsing HTML with JTidy (under contributions/ant) for the task does what you ask. Erik --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-user-help@jakarta.apache.org