From lucene-user-return-12896-apmail-jakarta-lucene-user-archive=jakarta.apache.org@jakarta.apache.org Wed Feb 02 14:23:05 2005 Return-Path: Delivered-To: apmail-jakarta-lucene-user-archive@www.apache.org Received: (qmail 17405 invoked from network); 2 Feb 2005 14:23:04 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur-2.apache.org with SMTP; 2 Feb 2005 14:23:04 -0000 Received: (qmail 12728 invoked by uid 500); 2 Feb 2005 14:22:56 -0000 Delivered-To: apmail-jakarta-lucene-user-archive@jakarta.apache.org Received: (qmail 12708 invoked by uid 500); 2 Feb 2005 14:22:56 -0000 Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Users List" Reply-To: "Lucene Users List" Delivered-To: mailing list lucene-user@jakarta.apache.org Received: (qmail 12692 invoked by uid 99); 2 Feb 2005 14:22:56 -0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_HELO_PASS,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (hermes.apache.org: domain of TheRanger@gmx.net designates 213.165.64.20 as permitted sender) Received: from imap.gmx.net (HELO mail.gmx.net) (213.165.64.20) by apache.org (qpsmtpd/0.28) with SMTP; Wed, 02 Feb 2005 06:22:54 -0800 Received: (qmail 28702 invoked by uid 0); 2 Feb 2005 14:22:52 -0000 Received: from 193.63.235.44 by www8.gmx.net with HTTP; Wed, 2 Feb 2005 15:22:52 +0100 (MET) Date: Wed, 2 Feb 2005 15:22:52 +0100 (MET) From: "Karl Koch" To: "Lucene Users List" MIME-Version: 1.0 References: <4200CC4A.5050405@ifit.uni-klu.ac.at> Subject: Re: which HTML parser is better? X-Priority: 3 (Normal) X-Authenticated: #21808356 Message-ID: <21524.1107354172@www8.gmx.net> X-Mailer: WWW-Mail 1.6 (Global Message Exchange) X-Flags: 0001 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 8bit X-Virus-Checked: Checked X-Spam-Rating: minotaur-2.apache.org 1.6.2 0/1000/N Hi, yes, but the library your are using is quite big. I was thinking that a 5kB code could actually do that. That sourceforge project is doing much more than that but I do not need it. Karl > Hi Karl, > > I already submitted a peace of code that removes the html tags. > Search for my previous answer in this thread. > > Best, > > Sergiu > > Karl Koch wrote: > > >Hello, > > > >I have been following this thread and have another question. > > > >Is there a piece of sourcecode (which is preferably very short and simple > >(KISS)) which allows to remove all HTML tags from HTML content? HTML 3.2 > >would be enough...also no frames, CSS, etc. > > > >I do not need to have the HTML strucutre tree or any other structure but > >need a facility to clean up HTML into its normal underlying content > before > >indexing that content as a whole. > > > >Karl > > > > > > > > > >>I think that depends on what you want to do. The Lucene demo parser > does > >>simple mapping of HTML files into Lucene Documents; it does not give you > a > >>parse tree for the HTML doc. CyberNeko is an extension of Xerces (uses > >> > >> > >the > > > > > >>same API; will likely become part of Xerces), and so maps an HTML > document > >>into a full DOM that you can manipulate easily for a wide range of > >>purposes. I haven't used JTidy at an API level and so don't know it as > >> > >> > >well -- > > > > > >>based on its UI, it appears to be focused primarily on HTML validation > and > >>error detection/correction. > >> > >>I use CyberNeko for a range of operations on HTML documents that go > beyond > >>indexing them in Lucene, and really like it. It has been robust for me > so > >>far. > >> > >>Chuck > >> > >> > -----Original Message----- > >> > From: Jingkang Zhang [mailto:zjingk@yahoo.com.cn] > >> > Sent: Tuesday, February 01, 2005 1:15 AM > >> > To: lucene-user@jakarta.apache.org > >> > Subject: which HTML parser is better? > >> > > >> > Three HTML parsers(Lucene web application > >> > demo,CyberNeko HTML Parser,JTidy) are mentioned in > >> > Lucene FAQ > >> > 1.3.27.Which is the best?Can it filter tags that are > >> > auto-created by MS-word 'Save As HTML files' function? > >> > > >> > _________________________________________________________ > >> > Do You Yahoo!? > >> > 150万曲MP3疯狂搜,带您闯入音乐殿堂 > >> > http://music.yisou.com/ > >> > 美女明星应有尽有,搜遍美图、艳图和酷图 > >> > http://image.yisou.com > >> > 1G就是1000兆,雅虎电邮自助扩容! > >> > > >>http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/ma > >> > il_1g/ > >> > > >> > > --------------------------------------------------------------------- > >> > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org > >> > For additional commands, e-mail: lucene-user-help@jakarta.apache.org > >> > >> > >>--------------------------------------------------------------------- > >>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org > >>For additional commands, e-mail: lucene-user-help@jakarta.apache.org > >> > >> > >> > > > > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org > For additional commands, e-mail: lucene-user-help@jakarta.apache.org > -- 10 GB Mailbox, 100 FreeSMS http://www.gmx.net/de/go/topmail +++ GMX - die erste Adresse f黵 Mail, Message, More +++ --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-user-help@jakarta.apache.org