lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Luke Shannon" <lshan...@futurebrand.com>
Subject Re: which HTML parser is better?
Date Wed, 02 Feb 2005 19:58:36 GMT
In our application I use regular expressions to strip all tags in one
situation and specific ones in another situation. Here is sample code for
both:

This strips all html 4.0 tags except <p>, <ul>, <br>, <li>, <strong>,
<em>,
<u>:

html_source =
Pattern.compile("</?\\s?(A|ABBR|ACRONYM|ADDRESS|APPLET|AREA|B|BASE|BASEFONT|
BDO|BIG|BLOCKQUOTE|BODY|BUTTON|CAPTION|CENTER|CITE|CODE|COL|COLGROUP|DD|DEL|
DFN|DIR|DIV|DL|DT|FIELDSET|FONT|FORM|FRAME|FRAMESET|H1|H2|H3|H4|H5|H6|HEAD|H
R|HTML|I|IFRAME|IMG|INPUT|INS|ISINDEX|KBD|LABEL|LEGEND|LINK|MAP|MENU|META|NO
FRAMES|NOSCRIPT|OBJECT|OL|OPTGROUP|OPTION|PARAM|PRE|Q|S|SAMP|SCRIPT|SELECT|S
MALL|SPAN|STRIKE|STYLE|SUB|SUP|TABLE|TBODY|TD|TEXTAREA|TFOOT|TH|THEAD|TITLE|
TR|TT|VAR)(.|\n)*?\\s?>",
Pattern.CASE_INSENSITIVE).matcher(html_source).replaceAll("");

When I want to strip anything in a tag I use the following pattern with the
code above:

String strPattern1 = "<\\s?(.|\n)*?\\s?>";

HTH

Luke



----- Original Message ----- 
From: "sergiu gordea" <gsergiu@ifit.uni-klu.ac.at>
To: "Lucene Users List" <lucene-user@jakarta.apache.org>
Sent: Wednesday, February 02, 2005 1:23 PM
Subject: Re: which HTML parser is better?


> Karl Koch wrote:
>
> >I am in control of the html, which means it is well formated HTML. I use
> >only HTML files which I have transformed from XML. No external HTML (e.g.
> >the web).
> >
> >Are there any very-short solutions for that?
> >
> >
> if you are using only correct formated HTML pages and you are in control
> of these pages.
> you can use a regular exprestion to remove the tags.
>
> something like
> replaceAll("<*>","");
>
> This is the ideea behind the operation. If you will search on google you
> will find a more robust
> regular expression.
>
> Using a simple regular expression will be a very cheap solution, that
> can cause you a lot of problems in the future.
>
>  It's up to you to use it ....
>
>  Best,
>
>  Sergiu
>
> >Karl
> >
> >
> >
> >>Karl Koch wrote:
> >>
> >>
> >>
> >>>Hi,
> >>>
> >>>yes, but the library your are using is quite big. I was thinking that a
> >>>
> >>>
> >>5kB
> >>
> >>
> >>>code could actually do that. That sourceforge project is doing much
more
> >>>than that but I do not need it.
> >>>
> >>>
> >>>
> >>>
> >>you need just the htmlparser.jar 200k.
> >>... you know ... the functionality is strongly correclated with the
size.
> >>
> >>  You can use 3 lines of code with a good regular expresion to eliminate
> >>the html tags,
> >>but this won't give you any guarantie that the text from the bad
> >>fromated html files will be
> >>correctly extracted...
> >>
> >>  Best,
> >>
> >>  Sergiu
> >>
> >>
> >>
> >>>Karl
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>> Hi Karl,
> >>>>
> >>>>I already submitted a peace of code that removes the html tags.
> >>>>Search for my previous answer in this thread.
> >>>>
> >>>> Best,
> >>>>
> >>>>  Sergiu
> >>>>
> >>>>Karl Koch wrote:
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>>Hello,
> >>>>>
> >>>>>I have  been following this thread and have another question.
> >>>>>
> >>>>>Is there a piece of sourcecode (which is preferably very short and
> >>>>>
> >>>>>
> >>simple
> >>
> >>
> >>>>>(KISS)) which allows to remove all HTML tags from HTML content? HTML
> >>>>>
> >>>>>
> >>3.2
> >>
> >>
> >>>>>would be enough...also no frames, CSS, etc.
> >>>>>
> >>>>>I do not need to have the HTML strucutre tree or any other structure
> >>>>>
> >>>>>
> >>but
> >>
> >>
> >>>>>need a facility to clean up HTML into its normal underlying content
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>before
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>>indexing that content as a whole.
> >>>>>
> >>>>>Karl
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>>I think that depends on what you want to do.  The Lucene demo
parser
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>does
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>>>simple mapping of HTML files into Lucene Documents; it does not
give
> >>>>>>
> >>>>>>
> >>you
> >>
> >>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>a
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>>>parse tree for the HTML doc.  CyberNeko is an extension of Xerces
> >>>>>>
> >>>>>>
> >>(uses
> >>
> >>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>the
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>>same API; will likely become part of Xerces), and so maps an
HTML
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>document
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>>>into a full DOM that you can manipulate easily for a wide range
of
> >>>>>>purposes.  I haven't used JTidy at an API level and so don't
know it
> >>>>>>
> >>>>>>
> >>as
> >>
> >>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>well --
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>>based on its UI, it appears to be focused primarily on HTML
validation
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>and
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>>>error detection/correction.
> >>>>>>
> >>>>>>I use CyberNeko for a range of operations on HTML documents that
go
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>beyond
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>>>indexing them in Lucene, and really like it.  It has been robust
for
> >>>>>>
> >>>>>>
> >>me
> >>
> >>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>so
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>>>far.
> >>>>>>
> >>>>>>Chuck
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>>-----Original Message-----
> >>>>>>>From: Jingkang Zhang [mailto:zjingk@yahoo.com.cn]
> >>>>>>>Sent: Tuesday, February 01, 2005 1:15 AM
> >>>>>>>To: lucene-user@jakarta.apache.org
> >>>>>>>Subject: which HTML parser is better?
> >>>>>>>
> >>>>>>>Three HTML parsers(Lucene web application
> >>>>>>>demo,CyberNeko HTML Parser,JTidy) are mentioned in
> >>>>>>>Lucene FAQ
> >>>>>>>1.3.27.Which is the best?Can it filter tags that are
> >>>>>>>auto-created by MS-word 'Save As HTML files' function?
> >>>>>>>
> >>>>>>>_________________________________________________________
> >>>>>>>Do You Yahoo!?
> >>>>>>>150ÍòÇúMP3·è¿ñËÑ£¬´øÄú´³ÈëÒôÀÖµîÌÃ
> >>>>>>>http://music.yisou.com/
> >>>>>>>ÃÀÅ®Ã÷ÐÇÓ¦Óо¡ÓУ¬ËѱéÃÀͼ¡¢ÑÞͼºÍ¿áͼ
> >>>>>>>http://image.yisou.com
> >>>>>>>1G¾ÍÊÇ1000Õ×£¬ÑÅ»¢µçÓÊ×ÔÖúÀ©ÈÝ£¡
> >>>>>>>
> >>>>>>>
> >>>>>>>
>
>>>>>http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/m
a
> >>>>>
> >>>>>
> >>>>>>>il_1g/
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>---------------------------------------------------------------------
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>>>>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> >>>>>>>For additional commands, e-mail:
> >>>>>>>
> >>>>>>>
> >>lucene-user-help@jakarta.apache.org
> >>
> >>
>
>>>>>>---------------------------------------------------------------------
> >>>>>>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> >>>>>>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>---------------------------------------------------------------------
> >>>>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> >>>>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>
> >>>
> >>>
> >>>
> >>---------------------------------------------------------------------
> >>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> >>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> >>
> >>
> >>
> >
> >
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message