lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: Lucene or nutch for indexing web documents
Date Wed, 28 Nov 2007 11:47:51 GMT
Seems reasonable to me, but I guess I wonder what kind of control you  
have that you don't in Nutch?  Maybe worth asking on Nutch.  Also, it  
is fairly easy in Nutch to separate the crawling aspect from the  
indexing aspect, such that you could use all of Nutch's power for  
crawling and extracting content, and then index in Lucene or Solr on  
your own.


On Nov 27, 2007, at 6:13 PM, bbrown wrote:

> I was considering not using nutch for indexing web documents.  I was  
> thinking
> either extracting the full HTML document or through the use of some  
> kind of
> web scraper html parser utility extracting only the text content  
> from a web
> page and then indexing that.
>
> I know it is strange, but I feel I have more control on what gets  
> indexed if I
> use just Lucene.  Eg, I can add more fields and also I guarantee I  
> will be
> able to search what gets indexed.
>
> Is this a bad approach or should I just use nutch?
>
> --
> Berlin Brown
> [berlin dot brown at gmail dot com]
> http://botspiritcompany.com/botlist/?
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

--------------------------
Grant Ingersoll
http://lucene.grantingersoll.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message