nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Markus Jelsma <markus.jel...@openindex.io>
Subject Re: Retrieve the original HTML from nutch-1.4 crawldb
Date Fri, 23 Dec 2011 10:03:46 GMT
Yes, use the SegmentReader tool to fetch data from the content directory.

On Friday 23 December 2011 08:28:26 Mathijs Homminga wrote:
> I believe they are stored in the /content subdir of a segment.
> If you need a lot of pages, you could also take a look at:
> http://www.commoncrawl.org/
> 
> On Dec 23, 2011, at 3:06 , 邓尧 wrote:
> > Hi,
> > 
> > I need tons of HTML pages to do a research. I followed the tutorial in
> > the wiki page and setup a nutch-1.4 crawler (without solr). I can now
> > dump the extracted text from the segments, unfortunately the HTML tags
> > are stripped. How can I retrieve the original HTML pages from the
> > crawled database? or are the original HTML pages actually stored by
> > nutch?
> > 
> > Thanks
> > 
> > -Yao

-- 
Markus Jelsma - CTO - Openindex

Mime
View raw message