Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm
Precedence: bulk
Reply-To: "Lucene Users List" <lucene-user@jakarta.apache.org>
Message-ID: <20020419175037.19492.qmail@web12705.mail.yahoo.com>
Date: Fri, 19 Apr 2002 10:50:37 -0700 (PDT)
From: Otis Gospodnetic <otis_gospodnetic@yahoo.com>
Subject: RE: HTML parser
To: Lucene Users List <lucene-user@jakarta.apache.org>, ian@plusfour.org
In-Reply-To: <DEEKJLBPBMNHDEECLKBEIEHHCEAA.ian@plusfour.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii

Such classes are not included with Lucene.
This was _just_ mentioned on this list earlier today.
Look at the archives and search for crawler, URL, lucene sandbox, etc.

Otis

--- Ian Forsyth <ian@plusfour.org> wrote:
> 
> Are there core classes part of lucene that allow one to feed lucene
> links,
> and 'it' will capture the contents of those urls into the index..
> 
> or does one write a file capture class to seek out the url store the
> file in
> a directory, then index the local directory..
> 
> Ian
> 
> 
> -----Original Message-----
> From: Terence Parr [mailto:parrt@jguru.com]
> Sent: Friday, April 19, 2002 1:38 AM
> To: Lucene Users List
> Subject: Re: HTML parser
> 
> 
> 
> On Thursday, April 18, 2002, at 10:28  PM, Otis Gospodnetic wrote:
> 
> :snip
> 
> Hi Otis,
> 
> I have an HTML parser built for ANTLR, but it's pretty strict in what
> it
> accepts.  Not sure how useful it will be for you, but here it is:
> 
> http://www.antlr.org/grammars/HTML
> 
> I am not sure what your goal is, but I personally have to scarf all
> sorts of HTML from various websites to such them into the jGuru
> search
> engine.  I use a simple stripHTML() method I wrote to handle it. 
> Works
> great.  Kills everything but the text.  is that the kind of thing you
> are looking for or do you really want to parse not filter?
> 
> Terence
> --
> Co-founder, http://www.jguru.com
> Creator, ANTLR Parser Generator: http://www.antlr.org
> 
> 
> --
> To unsubscribe, e-mail:
> <mailto:lucene-user-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
> <mailto:lucene-user-help@jakarta.apache.org>
> 
> 
> 
> --
> To unsubscribe, e-mail:  
> <mailto:lucene-user-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
> <mailto:lucene-user-help@jakarta.apache.org>
> 


__________________________________________________
Do You Yahoo!?
Yahoo! Tax Center - online filing with TurboTax
http://taxes.yahoo.com/

--
To unsubscribe, e-mail:   <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>