Return-Path: Delivered-To: apmail-jakarta-lucene-user-archive@apache.org Received: (qmail 67185 invoked from network); 19 Apr 2002 17:50:39 -0000 Received: from unknown (HELO nagoya.betaversion.org) (192.18.49.131) by daedalus.apache.org with SMTP; 19 Apr 2002 17:50:39 -0000 Received: (qmail 28961 invoked by uid 97); 19 Apr 2002 17:50:39 -0000 Delivered-To: qmlist-jakarta-archive-lucene-user@jakarta.apache.org Received: (qmail 28926 invoked by uid 97); 19 Apr 2002 17:50:38 -0000 Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Users List" Reply-To: "Lucene Users List" Delivered-To: mailing list lucene-user@jakarta.apache.org Received: (qmail 28913 invoked from network); 19 Apr 2002 17:50:38 -0000 Message-ID: <20020419175037.19492.qmail@web12705.mail.yahoo.com> Date: Fri, 19 Apr 2002 10:50:37 -0700 (PDT) From: Otis Gospodnetic Subject: RE: HTML parser To: Lucene Users List , ian@plusfour.org In-Reply-To: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N Such classes are not included with Lucene. This was _just_ mentioned on this list earlier today. Look at the archives and search for crawler, URL, lucene sandbox, etc. Otis --- Ian Forsyth wrote: > > Are there core classes part of lucene that allow one to feed lucene > links, > and 'it' will capture the contents of those urls into the index.. > > or does one write a file capture class to seek out the url store the > file in > a directory, then index the local directory.. > > Ian > > > -----Original Message----- > From: Terence Parr [mailto:parrt@jguru.com] > Sent: Friday, April 19, 2002 1:38 AM > To: Lucene Users List > Subject: Re: HTML parser > > > > On Thursday, April 18, 2002, at 10:28 PM, Otis Gospodnetic wrote: > > :snip > > Hi Otis, > > I have an HTML parser built for ANTLR, but it's pretty strict in what > it > accepts. Not sure how useful it will be for you, but here it is: > > http://www.antlr.org/grammars/HTML > > I am not sure what your goal is, but I personally have to scarf all > sorts of HTML from various websites to such them into the jGuru > search > engine. I use a simple stripHTML() method I wrote to handle it. > Works > great. Kills everything but the text. is that the kind of thing you > are looking for or do you really want to parse not filter? > > Terence > -- > Co-founder, http://www.jguru.com > Creator, ANTLR Parser Generator: http://www.antlr.org > > > -- > To unsubscribe, e-mail: > > For additional commands, e-mail: > > > > > -- > To unsubscribe, e-mail: > > For additional commands, e-mail: > > __________________________________________________ Do You Yahoo!? Yahoo! Tax Center - online filing with TurboTax http://taxes.yahoo.com/ -- To unsubscribe, e-mail: For additional commands, e-mail: