nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jason Manfield <>
Subject Re: [Nutch-general] using nutch just for crawling, not indexing?
Date Mon, 02 May 2005 22:03:57 GMT
Thanks for the pointer.
I suppose the is the core guy reading contents from the URLs and dumping it to
different directories in the filesystem (via Fetcher.outputPage), right? In that case, can
this be intercepted (via my code changes locally) to dump the extracted contents into our
proprietary system? Are the segments created as part of the Fetcher or before the call to
the Fetcher?
Jason wrote:
Jason - this is perfectly doable -- I do this for my social bookmarking

I think people tend to run Nutch using the nutch shell script that
comes with Nutch, but you can really call the Fetcher Java class
directly and programmatically yourself, as it has the main method. You
can do the same with the SegmentMergeTool. So, if you can write a Java
app, just call Nutch's Java classes the same way that the shell script

I can't help you with reading Nutch's files with C#, but the source is
there, so you should be able to write file readers in C#.

Simpy -- -- tags, social bookmarks, personal search engine

--- Jason Manfield wrote:
> We would like to use nutch just for crawling, and then index the
> crawled database into our proprietory datastore/index. How do we go
> about this? I see that nutch is a shell script, so it is possible to
> just crawl. Once it crawls, I suppose the crawled data is dumped into
> webdb. Are there exposed APIs to extract the data from webdb? 
> One more catch -- our company is a .NET shop :((, so we would like to
> use C# to read the data of the fetched/crawled pages for further
> indexing.
> Ideas/suggestions?
> Any plans to have nutch for .NET (like dotLucene)?
> __________________________________________________
> Do You Yahoo!?
> Tired of spam? Yahoo! Mail has the best spam protection around 

Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message