nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alex <>
Subject Nutch Crawl Vs. Merge Time Complexity
Date Fri, 03 Mar 2006 21:24:17 GMT
Hi there,

I got a couple of questions that I need help with, Please help.

I'm sort of new to this nutch-dev emailing listing. I'm not quite should how or what's the
appropriate way of getting envolve with the Nutch development group. Please let me know Who
should I be contacting in regards to issue and question about Nutch?

I've been using Nutch and customizing it so that the returned search results can be manage
by the use of paging on the web. I'm doing this for my company and my supervisor has agreed
to contribute the code for paging to the nutch community. Please help guide me on how to proceed
with this.

Finally, a technical question. I've using Nutch v0.7 and I've been running nutch on our company
unix system and it was setup to crawl our intranet sites for updates daily, I've tried using
the Merge, dedup, updatedb, and etc...I'd notice the time complexity and efficiency was less
productive than doing a fresh new crawl. For example if I have two separate crawls from two
different domains such as hotmail and yahoo, what would the time complexity for nutch to crawl
this two domains and then do a merge compare to just doing a single full crawl of both domains?
My guess would be that it will take nutch the same amount of times to do either one, if that
is so is there a reason to use the Merge at all? Please let me know what you think, I'm still
trying to understand how nutch behave, don't mean to criticize anyone who've work on the Merge
feature for nutch. 



Yahoo! Mail
Bring photos to life! New PhotoMail  makes sharing a breeze. 
  • Unnamed multipart/alternative (inline, 8-Bit, 0 bytes)
View raw message