lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "bbrown" <>
Subject Lucene or nutch for indexing web documents
Date Tue, 27 Nov 2007 23:13:06 GMT
I was considering not using nutch for indexing web documents.  I was thinking
either extracting the full HTML document or through the use of some kind of
web scraper html parser utility extracting only the text content from a web
page and then indexing that.

I know it is strange, but I feel I have more control on what gets indexed if I
use just Lucene.  Eg, I can add more fields and also I guarantee I will be
able to search what gets indexed.

Is this a bad approach or should I just use nutch?

Berlin Brown
[berlin dot brown at gmail dot com]

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message