lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "bbrown" <bbr...@botspiritcompany.com>
Subject Lucene or nutch for indexing web documents
Date Tue, 27 Nov 2007 23:13:06 GMT
I was considering not using nutch for indexing web documents.  I was thinking
either extracting the full HTML document or through the use of some kind of
web scraper html parser utility extracting only the text content from a web
page and then indexing that.

I know it is strange, but I feel I have more control on what gets indexed if I
use just Lucene.  Eg, I can add more fields and also I guarantee I will be
able to search what gets indexed.

Is this a bad approach or should I just use nutch?

--
Berlin Brown
[berlin dot brown at gmail dot com]
http://botspiritcompany.com/botlist/?


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message