lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erik Hatcher" <li...@ehatchersolutions.com>
Subject Re: HTML parser
Date Sat, 20 Apr 2002 00:32:45 GMT
HttpUnit (which uses JTidy under the covers) makes childs play out of
pulling out links and navigating to them.

The only caveat (and this would be true for practically all tools, I
suspect) is that the HTML has to be relatively well-formed for it to work
well.  JTidy can be somewhat forgiving though.

    Erik

----- Original Message -----
From: "David Black" <black@apple.com>
To: "Lucene Users List" <lucene-user@jakarta.apache.org>
Sent: Friday, April 19, 2002 5:26 PM
Subject: Re: HTML parser


> While trying to research the same thing, I found the following...here's
> a good example of link extraction.....
>
> http://developer.java.sun.com/developer/TechTips/1999/tt0923.html
>
> It seems like I could use this to also get the text out from between the
> tags but haven't been able to do it yet.  It seems like it should be
> simple but geez...my head hurts.
>
>
>
>
>
>
> On Friday, April 19, 2002, at 01:40 PM, Ian Forsyth wrote:
>
> >
> > Are there core classes part of lucene that allow one to feed lucene
> > links,
> > and 'it' will capture the contents of those urls into the index..
> >
> > or does one write a file capture class to seek out the url store the
> > file in
> > a directory, then index the local directory..
> >
> > Ian
> >
> >
> > -----Original Message-----
> > From: Terence Parr [mailto:parrt@jguru.com]
> > Sent: Friday, April 19, 2002 1:38 AM
> > To: Lucene Users List
> > Subject: Re: HTML parser
> >
> >
> >
> > On Thursday, April 18, 2002, at 10:28  PM, Otis Gospodnetic wrote:
> >
> > :snip
> >
> > Hi Otis,
> >
> > I have an HTML parser built for ANTLR, but it's pretty strict in what it
> > accepts.  Not sure how useful it will be for you, but here it is:
> >
> > http://www.antlr.org/grammars/HTML
> >
> > I am not sure what your goal is, but I personally have to scarf all
> > sorts of HTML from various websites to such them into the jGuru search
> > engine.  I use a simple stripHTML() method I wrote to handle it.  Works
> > great.  Kills everything but the text.  is that the kind of thing you
> > are looking for or do you really want to parse not filter?
> >
> > Terence
> > --
> > Co-founder, http://www.jguru.com
> > Creator, ANTLR Parser Generator: http://www.antlr.org
> >
> >
> > --
> > To unsubscribe, e-mail:
> > <mailto:lucene-user-unsubscribe@jakarta.apache.org>
> > For additional commands, e-mail:
> > <mailto:lucene-user-help@jakarta.apache.org>
> >
> >
> >
> > --
> > To unsubscribe, e-mail:   <mailto:lucene-user-
> > unsubscribe@jakarta.apache.org>
> > For additional commands, e-mail: <mailto:lucene-user-
> > help@jakarta.apache.org>
> >
>
>
> --
> To unsubscribe, e-mail:
<mailto:lucene-user-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
<mailto:lucene-user-help@jakarta.apache.org>
>
>


--
To unsubscribe, e-mail:   <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>


Mime
View raw message