lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ian Forsyth" <...@plusfour.org>
Subject RE: HTML parser
Date Fri, 19 Apr 2002 17:40:12 GMT

Are there core classes part of lucene that allow one to feed lucene links,
and 'it' will capture the contents of those urls into the index..

or does one write a file capture class to seek out the url store the file in
a directory, then index the local directory..

Ian


-----Original Message-----
From: Terence Parr [mailto:parrt@jguru.com]
Sent: Friday, April 19, 2002 1:38 AM
To: Lucene Users List
Subject: Re: HTML parser



On Thursday, April 18, 2002, at 10:28  PM, Otis Gospodnetic wrote:

:snip

Hi Otis,

I have an HTML parser built for ANTLR, but it's pretty strict in what it
accepts.  Not sure how useful it will be for you, but here it is:

http://www.antlr.org/grammars/HTML

I am not sure what your goal is, but I personally have to scarf all
sorts of HTML from various websites to such them into the jGuru search
engine.  I use a simple stripHTML() method I wrote to handle it.  Works
great.  Kills everything but the text.  is that the kind of thing you
are looking for or do you really want to parse not filter?

Terence
--
Co-founder, http://www.jguru.com
Creator, ANTLR Parser Generator: http://www.antlr.org


--
To unsubscribe, e-mail:
<mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail:
<mailto:lucene-user-help@jakarta.apache.org>



--
To unsubscribe, e-mail:   <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>


Mime
View raw message