lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "lucene@libero.it" <luc...@libero.it>
Subject Re: HTML parser
Date Sat, 20 Apr 2002 13:29:17 GMT
Hi all,

I'm very interested about this thread. I also have to solve the problem 
of spidering web sites, creating index (weel about this there is the 
BIG problem that lucene can't be integrated easily with a DB), 
extracting links from the page repeating all the process.

For extracting links from a page I'm thinking to use JTidy. I think 
that with this library you can also parse a non well formed page (that 
you can take from the web with URLConnection) setting the property to 
clean the page. The class Tidy() returns a org.w3c.dom.Document that 
you can use for analizing all the document: for example you can use 
doc.getElementsByTagName(a) for taking all the a elements. You can 
parse as xml.

Did someone solve the problem to spider recursively a web pages?

Laura




> 
> >While trying to research the same thing, I found the following...here
's a 
> >good example of link extraction.....
> 
> Try http://www.quiotix.com/opensource/html-parser
> 
> Its easy to write a Visitor which extracts the links; should take abou
t ten 
> lines of code.
> 
> 
> 
> --
> Brian Goetz
> Quiotix Corporation
> brian@quiotix.com           Tel: 650-843-1300            Fax: 650-324-
8032
> 
> http://www.quiotix.com
> 
> 
> --
> To unsubscribe, e-mail:   <mailto:lucene-user-
unsubscribe@jakarta.apache.org>
> For additional commands, e-mail: <mailto:lucene-user-
help@jakarta.apache.org>
> 
> 
Mime
View raw message