lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Terence Parr <>
Subject Re: HTML parser
Date Fri, 19 Apr 2002 17:21:30 GMT
Hi Otis,

The idea behind stripHTML is pretty simple.  It's just a hand-built 
lexer that looks like this:

while more char
	if comment start, scarf til end comment
	if char is < then
		if SCRIPT tag scarf til end SCRIPT;
			[same with A, STYLE, HEAD, FORM etc...]
	if &blort; and not stuff lile LT, GT, AMP, QUOT, NBSP, scarf
	if whitespace scarf leaving just one bit of whitespace
	otherwise just add char to stripped text.

Nothing tricky in the code, but I should have used ANTLR itself to build 
this lexer.  it got to be a big method with lots of book keeping code as 
in all lexers.


On Thursday, April 18, 2002, at 11:05  PM, Otis Gospodnetic wrote:

> Hello Terrence,
> Ah, you got me.
> I guess I need a bit of both.
> I need to just strip HTML and get raw body text so that I can stick it
> in Lucene's index.
> I would also like something that can extract at least the
> <title>...</title> stuff, so that I can stick that in a separate field
> in Lucene index.
> While doing that I, like you, need to be able to handle poorly
> formatted web pages.
> In a future I may need something that has the ability to extract HREFs,
> but I'll stick to one of the XP principles and just look for something
> that meets current needs :)
> I looked for ANTLR-based HTML parser a few days ago, but must have
> missed the one you pointed out.  I'll take a look at it now.
> Can you share or describe your stripHTML method?  Simple java that
> looks for <s and >s or something smarter?
> Thanks,
> Otis
> P.S.
> This type of thing makes me wish I can use Perl or Python :)
> --- Terence Parr <> wrote:
>> On Thursday, April 18, 2002, at 10:28  PM, Otis Gospodnetic wrote:
>>> Hello,
>>> I need to select an HTML parser for the application that I'm
>> writing
>>> and I'm not sure what to choose.
>>> The HTML parser included with Lucene looks flimsy, JTidy looks like
>> a
>>> hack and an overkill, using classes written for Swing
>>> (javax.swing.text.html.parser) seems wrong, and I haven't tried
>> David
>>> McNicol's parser (included with Spindle).
>>> Somebody on this list must have done some research on this subject.
>>> Can anyone share some experiences?
>>> Have you found a better HTML parser than any of those I listed
>> above?
>>> If your application deals with HTML, what do you use for parsing
>> it?
>> Hi Otis,
>> I have an HTML parser built for ANTLR, but it's pretty strict in what
>> it
>> accepts.  Not sure how useful it will be for you, but here it is:
>> I am not sure what your goal is, but I personally have to scarf all
>> sorts of HTML from various websites to such them into the jGuru
>> search
>> engine.  I use a simple stripHTML() method I wrote to handle it.
>> Works
>> great.  Kills everything but the text.  is that the kind of thing you
>> are looking for or do you really want to parse not filter?
>> Terence
Creator, ANTLR Parser Generator:
>> --
>> To unsubscribe, e-mail:
>> <>
>> For additional commands, e-mail:
>> <>
> __________________________________________________
> Do You Yahoo!?
> Yahoo! Tax Center - online filing with TurboTax
> --
> To unsubscribe, e-mail:   <mailto:lucene-user-
> For additional commands, e-mail: <mailto:lucene-user-

To unsubscribe, e-mail:   <>
For additional commands, e-mail: <>

View raw message