Return-Path: Delivered-To: apmail-jakarta-lucene-user-archive@apache.org Received: (qmail 79194 invoked from network); 20 Apr 2002 00:33:04 -0000 Received: from unknown (HELO nagoya.betaversion.org) (192.18.49.131) by daedalus.apache.org with SMTP; 20 Apr 2002 00:33:04 -0000 Received: (qmail 19869 invoked by uid 97); 20 Apr 2002 00:33:10 -0000 Delivered-To: qmlist-jakarta-archive-lucene-user@jakarta.apache.org Received: (qmail 19828 invoked by uid 97); 20 Apr 2002 00:33:09 -0000 Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Users List" Reply-To: "Lucene Users List" Delivered-To: mailing list lucene-user@jakarta.apache.org Received: (qmail 19817 invoked from network); 20 Apr 2002 00:33:09 -0000 X-Sent: 20 Apr 2002 00:32:47 GMT Message-ID: <00fa01c1e802$e4e53610$6501a8c0@darden.virginia.edu> From: "Erik Hatcher" To: "Lucene Users List" References: <1B4E2CBB-53DC-11D6-9587-0003934469CE@apple.com> Subject: Re: HTML parser Date: Fri, 19 Apr 2002 20:32:45 -0400 MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit X-Priority: 3 X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook Express 6.00.2600.0000 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2600.0000 X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N HttpUnit (which uses JTidy under the covers) makes childs play out of pulling out links and navigating to them. The only caveat (and this would be true for practically all tools, I suspect) is that the HTML has to be relatively well-formed for it to work well. JTidy can be somewhat forgiving though. Erik ----- Original Message ----- From: "David Black" To: "Lucene Users List" Sent: Friday, April 19, 2002 5:26 PM Subject: Re: HTML parser > While trying to research the same thing, I found the following...here's > a good example of link extraction..... > > http://developer.java.sun.com/developer/TechTips/1999/tt0923.html > > It seems like I could use this to also get the text out from between the > tags but haven't been able to do it yet. It seems like it should be > simple but geez...my head hurts. > > > > > > > On Friday, April 19, 2002, at 01:40 PM, Ian Forsyth wrote: > > > > > Are there core classes part of lucene that allow one to feed lucene > > links, > > and 'it' will capture the contents of those urls into the index.. > > > > or does one write a file capture class to seek out the url store the > > file in > > a directory, then index the local directory.. > > > > Ian > > > > > > -----Original Message----- > > From: Terence Parr [mailto:parrt@jguru.com] > > Sent: Friday, April 19, 2002 1:38 AM > > To: Lucene Users List > > Subject: Re: HTML parser > > > > > > > > On Thursday, April 18, 2002, at 10:28 PM, Otis Gospodnetic wrote: > > > > :snip > > > > Hi Otis, > > > > I have an HTML parser built for ANTLR, but it's pretty strict in what it > > accepts. Not sure how useful it will be for you, but here it is: > > > > http://www.antlr.org/grammars/HTML > > > > I am not sure what your goal is, but I personally have to scarf all > > sorts of HTML from various websites to such them into the jGuru search > > engine. I use a simple stripHTML() method I wrote to handle it. Works > > great. Kills everything but the text. is that the kind of thing you > > are looking for or do you really want to parse not filter? > > > > Terence > > -- > > Co-founder, http://www.jguru.com > > Creator, ANTLR Parser Generator: http://www.antlr.org > > > > > > -- > > To unsubscribe, e-mail: > > > > For additional commands, e-mail: > > > > > > > > > > -- > > To unsubscribe, e-mail: > unsubscribe@jakarta.apache.org> > > For additional commands, e-mail: > help@jakarta.apache.org> > > > > > -- > To unsubscribe, e-mail: > For additional commands, e-mail: > > -- To unsubscribe, e-mail: For additional commands, e-mail: