nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From <ogjunk-nu...@yahoo.com>
Subject Re: [Nutch-general] using nutch just for crawling, not indexing?
Date Tue, 03 May 2005 18:44:40 GMT
Hi Jason,

That looks correct, Fetcher.outputPage(...) writes FetcherOutput to
disk via ArrayFile.Writter instance.

Otis
____________________________________________________________________
Simpy -- simpy.com -- tags, social bookmarks, personal search engine


--- Jason Manfield <rarish911@yahoo.com> wrote:
> Otis
>  
> Thanks for the pointer.
>  
> I suppose the Fetcher.java is the core guy reading contents from the
> URLs and dumping it to different directories in the filesystem (via
> Fetcher.outputPage), right? In that case, can this be intercepted
> (via my code changes locally) to dump the extracted contents into our
> proprietary system? Are the segments created as part of the Fetcher
> or before the call to the Fetcher?
>  
> Thanks
>  
> Jason
> 
> 
> ogjunk-nutch@yahoo.com wrote:
> Jason - this is perfectly doable -- I do this for my social
> bookmarking
> project, Simpy.com 
> 
> I think people tend to run Nutch using the nutch shell script that
> comes with Nutch, but you can really call the Fetcher Java class
> directly and programmatically yourself, as it has the main method.
> You
> can do the same with the SegmentMergeTool. So, if you can write a
> Java
> app, just call Nutch's Java classes the same way that the shell
> script
> does.
> 
> I can't help you with reading Nutch's files with C#, but the source
> is
> there, so you should be able to write file readers in C#.
> 
> Otis
> ____________________________________________________________________
> Simpy -- simpy.com -- tags, social bookmarks, personal search engine
> 
> 
> 
> --- Jason Manfield wrote:
> > We would like to use nutch just for crawling, and then index the
> > crawled database into our proprietory datastore/index. How do we go
> > about this? I see that nutch is a shell script, so it is possible
> to
> > just crawl. Once it crawls, I suppose the crawled data is dumped
> into
> > webdb. Are there exposed APIs to extract the data from webdb? 
> > 
> > One more catch -- our company is a .NET shop :((, so we would like
> to
> > use C# to read the data of the fetched/crawled pages for further
> > indexing.
> > 
> > Ideas/suggestions?
> > 
> > Any plans to have nutch for .NET (like dotLucene)?


Mime
View raw message