lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <otis_gospodne...@yahoo.com>
Subject Re: HTML parser
Date Fri, 19 Apr 2002 06:05:38 GMT
Hello Terrence,

Ah, you got me.
I guess I need a bit of both.
I need to just strip HTML and get raw body text so that I can stick it
in Lucene's index.
I would also like something that can extract at least the
<title>...</title> stuff, so that I can stick that in a separate field
in Lucene index.
While doing that I, like you, need to be able to handle poorly
formatted web pages.

In a future I may need something that has the ability to extract HREFs,
but I'll stick to one of the XP principles and just look for something
that meets current needs :)

I looked for ANTLR-based HTML parser a few days ago, but must have
missed the one you pointed out.  I'll take a look at it now.
Can you share or describe your stripHTML method?  Simple java that
looks for <s and >s or something smarter?

Thanks,
Otis
P.S.
This type of thing makes me wish I can use Perl or Python :)


--- Terence Parr <parrt@jguru.com> wrote:
> 
> On Thursday, April 18, 2002, at 10:28  PM, Otis Gospodnetic wrote:
> 
> > Hello,
> >
> > I need to select an HTML parser for the application that I'm
> writing
> > and I'm not sure what to choose.
> > The HTML parser included with Lucene looks flimsy, JTidy looks like
> a
> > hack and an overkill, using classes written for Swing
> > (javax.swing.text.html.parser) seems wrong, and I haven't tried
> David
> > McNicol's parser (included with Spindle).
> >
> > Somebody on this list must have done some research on this subject.
> > Can anyone share some experiences?
> > Have you found a better HTML parser than any of those I listed
> above?
> > If your application deals with HTML, what do you use for parsing
> it?
> 
> Hi Otis,
> 
> I have an HTML parser built for ANTLR, but it's pretty strict in what
> it 
> accepts.  Not sure how useful it will be for you, but here it is:
> 
> http://www.antlr.org/grammars/HTML
> 
> I am not sure what your goal is, but I personally have to scarf all 
> sorts of HTML from various websites to such them into the jGuru
> search 
> engine.  I use a simple stripHTML() method I wrote to handle it. 
> Works 
> great.  Kills everything but the text.  is that the kind of thing you
> 
> are looking for or do you really want to parse not filter?
> 
> Terence
> --
> Co-founder, http://www.jguru.com
> Creator, ANTLR Parser Generator: http://www.antlr.org
> 
> 
> --
> To unsubscribe, e-mail:  
> <mailto:lucene-user-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
> <mailto:lucene-user-help@jakarta.apache.org>
> 


__________________________________________________
Do You Yahoo!?
Yahoo! Tax Center - online filing with TurboTax
http://taxes.yahoo.com/

--
To unsubscribe, e-mail:   <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>


Mime
View raw message