incubator-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nick Lothian <nick.loth...@essential.com.au>
Subject RE: Mentor(s) required for a search engine project
Date Wed, 17 Nov 2004 22:40:36 GMT
Are you familar with Nutch (http://www.nutch.org/)?

Nick

> 
> Andreas,
> 
> The items 10 and 11 from the Lucene FAQ provide the (partial) answer.
> 
> --------------------------------------------------------------
> ------------------
> 
> 10. Can I use Lucene to crawl my site or other sites on the Internet ?
> No. Lucene does not know how to access external document, nor 
> does it know how to extract the content and links of HTML and 
> other document format. Lucene focus on the indexing and 
> searching and does it great. 
> 
> --------------------------------------------------------------
> ------------------
> 
> 11. How can I extract the content of HTML pages ?
> Lucene (at least the current version) does not provide 
> handlers for various document formats and leaves this task to 
> the application. To extract content form HTML pages, you may 
> use an HTML parser (there are several free versions on the 
> Internet). If you have hard time finding one, you can post a 
> question in the Lucene User mailing list. 
> 
> (tip by T.J.Mather) Lucene includes an HTML parser in the 
> demo/HTMLParser directory of the distribution. This is used 
> by the demo/IndexHTML.java class. 
> --------------------------------------------------------------
> ------------------
> 
> Gregory
> 
> -----Original Message-----
> From: Andreas Kuckartz [mailto:A.Kuckartz@ping.de]
> Sent: Dienstag, 16. November 2004 16:37
> To: general@incubator.apache.org
> Subject: Re: Mentor(s) required for a search engine project
> 
> 
> I am no potential sponsor but would like to see a comparison 
> to Apache Jakarta
> Lucene (http://jakarta.apache.org/lucene/docs/index.html) 
> which is implemented
> in Java.
> 
> Andreas
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> For additional commands, e-mail: general-help@incubator.apache.org
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org


Mime
View raw message