Return-Path: Delivered-To: apmail-jakarta-lucene-user-archive@www.apache.org Received: (qmail 68974 invoked from network); 2 Feb 2005 18:03:39 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur-2.apache.org with SMTP; 2 Feb 2005 18:03:39 -0000 Received: (qmail 20505 invoked by uid 500); 2 Feb 2005 18:03:26 -0000 Delivered-To: apmail-jakarta-lucene-user-archive@jakarta.apache.org Received: (qmail 20450 invoked by uid 500); 2 Feb 2005 18:03:26 -0000 Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Users List" Reply-To: "Lucene Users List" Delivered-To: mailing list lucene-user@jakarta.apache.org Received: (qmail 20385 invoked by uid 99); 2 Feb 2005 18:03:25 -0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_HELO_PASS,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (hermes.apache.org: domain of TheRanger@gmx.net designates 213.165.64.20 as permitted sender) Received: from imap.gmx.net (HELO mail.gmx.net) (213.165.64.20) by apache.org (qpsmtpd/0.28) with SMTP; Wed, 02 Feb 2005 10:03:24 -0800 Received: (qmail 21458 invoked by uid 0); 2 Feb 2005 18:03:20 -0000 Received: from 193.63.235.44 by www32.gmx.net with HTTP; Wed, 2 Feb 2005 19:03:20 +0100 (MET) Date: Wed, 2 Feb 2005 19:03:20 +0100 (MET) From: "Karl Koch" To: "Lucene Users List" MIME-Version: 1.0 References: <4200E39F.5000208@ifit.uni-klu.ac.at> Subject: Re: which HTML parser is better? X-Priority: 3 (Normal) X-Authenticated: #21808356 Message-ID: <5164.1107367400@www32.gmx.net> X-Mailer: WWW-Mail 1.6 (Global Message Exchange) X-Flags: 0001 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 8bit X-Virus-Checked: Checked X-Spam-Rating: minotaur-2.apache.org 1.6.2 0/1000/N I am in control of the html, which means it is well formated HTML. I use only HTML files which I have transformed from XML. No external HTML (e.g. the web). Are there any very-short solutions for that? Karl > Karl Koch wrote: > > >Hi, > > > >yes, but the library your are using is quite big. I was thinking that a > 5kB > >code could actually do that. That sourceforge project is doing much more > >than that but I do not need it. > > > > > you need just the htmlparser.jar 200k. > ... you know ... the functionality is strongly correclated with the size. > > You can use 3 lines of code with a good regular expresion to eliminate > the html tags, > but this won't give you any guarantie that the text from the bad > fromated html files will be > correctly extracted... > > Best, > > Sergiu > > >Karl > > > > > > > >> Hi Karl, > >> > >> I already submitted a peace of code that removes the html tags. > >> Search for my previous answer in this thread. > >> > >> Best, > >> > >> Sergiu > >> > >>Karl Koch wrote: > >> > >> > >> > >>>Hello, > >>> > >>>I have been following this thread and have another question. > >>> > >>>Is there a piece of sourcecode (which is preferably very short and > simple > >>>(KISS)) which allows to remove all HTML tags from HTML content? HTML > 3.2 > >>>would be enough...also no frames, CSS, etc. > >>> > >>>I do not need to have the HTML strucutre tree or any other structure > but > >>>need a facility to clean up HTML into its normal underlying content > >>> > >>> > >>before > >> > >> > >>>indexing that content as a whole. > >>> > >>>Karl > >>> > >>> > >>> > >>> > >>> > >>> > >>>>I think that depends on what you want to do. The Lucene demo parser > >>>> > >>>> > >>does > >> > >> > >>>>simple mapping of HTML files into Lucene Documents; it does not give > you > >>>> > >>>> > >>a > >> > >> > >>>>parse tree for the HTML doc. CyberNeko is an extension of Xerces > (uses > >>>> > >>>> > >>>> > >>>> > >>>the > >>> > >>> > >>> > >>> > >>>>same API; will likely become part of Xerces), and so maps an HTML > >>>> > >>>> > >>document > >> > >> > >>>>into a full DOM that you can manipulate easily for a wide range of > >>>>purposes. I haven't used JTidy at an API level and so don't know it > as > >>>> > >>>> > >>>> > >>>> > >>>well -- > >>> > >>> > >>> > >>> > >>>>based on its UI, it appears to be focused primarily on HTML validation > >>>> > >>>> > >>and > >> > >> > >>>>error detection/correction. > >>>> > >>>>I use CyberNeko for a range of operations on HTML documents that go > >>>> > >>>> > >>beyond > >> > >> > >>>>indexing them in Lucene, and really like it. It has been robust for > me > >>>> > >>>> > >>so > >> > >> > >>>>far. > >>>> > >>>>Chuck > >>>> > >>>> > -----Original Message----- > >>>> > From: Jingkang Zhang [mailto:zjingk@yahoo.com.cn] > >>>> > Sent: Tuesday, February 01, 2005 1:15 AM > >>>> > To: lucene-user@jakarta.apache.org > >>>> > Subject: which HTML parser is better? > >>>> > > >>>> > Three HTML parsers(Lucene web application > >>>> > demo,CyberNeko HTML Parser,JTidy) are mentioned in > >>>> > Lucene FAQ > >>>> > 1.3.27.Which is the best?Can it filter tags that are > >>>> > auto-created by MS-word 'Save As HTML files' function? > >>>> > > >>>> > _________________________________________________________ > >>>> > Do You Yahoo!? > >>>> > 150����MP3����ѣ������������ֵ��� > >>>> > http://music.yisou.com/ > >>>> > ��Ů����Ӧ�о��У��ѱ���ͼ����ͼ�Ϳ�ͼ > >>>> > http://image.yisou.com > >>>> > 1G����1000�ף��Ż������������ݣ� > >>>> > > >>>>http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/ma > >>>> > il_1g/ > >>>> > > >>>> > > >>>> > >>>> > >>--------------------------------------------------------------------- > >> > >> > >>>> > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org > >>>> > For additional commands, e-mail: > lucene-user-help@jakarta.apache.org > >>>> > >>>> > >>>>--------------------------------------------------------------------- > >>>>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org > >>>>For additional commands, e-mail: lucene-user-help@jakarta.apache.org > >>>> > >>>> > >>>> > >>>> > >>>> > >>> > >>> > >>> > >>> > >>--------------------------------------------------------------------- > >>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org > >>For additional commands, e-mail: lucene-user-help@jakarta.apache.org > >> > >> > >> > > > > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org > For additional commands, e-mail: lucene-user-help@jakarta.apache.org > -- GMX im TV ... Die Gedanken sind frei ... Schon gesehen? Jetzt Spot online ansehen: http://www.gmx.net/de/go/tv-spot --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-user-help@jakarta.apache.org