nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ian Reardon <>
Subject Some Nutch Questions
Date Wed, 04 May 2005 17:15:10 GMT
I would like to build an engine based on a hand full of hand picked
sites from a specific domain.   I had a few questions.

How many documents can I fit on a single server implementation (2 cpu
xeon)?  With space being irrelevant aprox. how many documents can I
have on a single node with respectable search performance?

My idea is to have a hand full of sites that I judge for quality and
index these on a regular basis maybe... once a month.  I would like to
add new sites over time.  Does this sound feasible with nutch?

What method would be best for this type of application? I setup nutch
and crawled a very small sample using method 1 in the tutorial
"Intranet crawl"  I was unable to get whole web crawl to work.  What
is that -dmozfile flag?  I don't want to base this off dmoz.  If
anyone could point me to some documentation or tutorial that better
explains whole web crawling I would appreciate it.  Thanks a lot.

View raw message