incubator-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From <kozlov...@sipo.gess.ethz.ch>
Subject RE: Mentor(s) required for a search engine project
Date Tue, 16 Nov 2004 16:04:53 GMT
Andreas,

The items 10 and 11 from the Lucene FAQ provide the (partial) answer.

--------------------------------------------------------------------------------

10. Can I use Lucene to crawl my site or other sites on the Internet ?
No. Lucene does not know how to access external document, nor does it know how to extract
the content and links of HTML and other document format. Lucene focus on the indexing and
searching and does it great. 

--------------------------------------------------------------------------------

11. How can I extract the content of HTML pages ?
Lucene (at least the current version) does not provide handlers for various document formats
and leaves this task to the application. To extract content form HTML pages, you may use an
HTML parser (there are several free versions on the Internet). If you have hard time finding
one, you can post a question in the Lucene User mailing list. 

(tip by T.J.Mather) Lucene includes an HTML parser in the demo/HTMLParser directory of the
distribution. This is used by the demo/IndexHTML.java class. 
--------------------------------------------------------------------------------

Gregory

-----Original Message-----
From: Andreas Kuckartz [mailto:A.Kuckartz@ping.de]
Sent: Dienstag, 16. November 2004 16:37
To: general@incubator.apache.org
Subject: Re: Mentor(s) required for a search engine project


I am no potential sponsor but would like to see a comparison to Apache Jakarta
Lucene (http://jakarta.apache.org/lucene/docs/index.html) which is implemented
in Java.

Andreas

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org


Mime
View raw message