lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "" <>
Subject Re: HTML parser
Date Sat, 20 Apr 2002 13:29:17 GMT
Hi all,

I'm very interested about this thread. I also have to solve the problem 
of spidering web sites, creating index (weel about this there is the 
BIG problem that lucene can't be integrated easily with a DB), 
extracting links from the page repeating all the process.

For extracting links from a page I'm thinking to use JTidy. I think 
that with this library you can also parse a non well formed page (that 
you can take from the web with URLConnection) setting the property to 
clean the page. The class Tidy() returns a org.w3c.dom.Document that 
you can use for analizing all the document: for example you can use 
doc.getElementsByTagName(a) for taking all the a elements. You can 
parse as xml.

Did someone solve the problem to spider recursively a web pages?


> >While trying to research the same thing, I found the
's a 
> >good example of link extraction.....
> Try
> Its easy to write a Visitor which extracts the links; should take abou
t ten 
> lines of code.
> --
> Brian Goetz
> Quiotix Corporation
>           Tel: 650-843-1300            Fax: 650-324-
> --
> To unsubscribe, e-mail:   <mailto:lucene-user->
> For additional commands, e-mail: <mailto:lucene-user->
View raw message