nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ankit Dangi <dangian...@gmail.com>
Subject SegmentReader: Why Multiple CrawlDatum section for a record..
Date Tue, 18 Aug 2009 07:10:43 GMT
Hello All,

After performing a crawl using Nutch, I wanted to read the content of all
the crawled URLs. I performed the following command: "$NUTCH_HOME/bin/nutch
readseg -dump $segment myseg"; where, $segment contains the name of the
segment file, and 'myseg' is the name of the directory where the dump of the
segment is created.

I understand that Nutch segment has 6 sub-directories.. crawl_generate,
crawl_fetch, crawl_parse, parse_data, parse_text and content. One of the
record obtained from the dump file has been kept at this URL:
http://dangiankit.googlepages.com/rec-13.txt for your reference. Can anyone
please look into the file and let me know as to why do we have 8 (eight)
CrawlDatum sections.. I believe, there should have been only 3 such sections
each for crawl_generate, crawl_fetch and crawl_parse. For other records, the
count varies. Also, any other information regarding the CrawlDatum sections
would be appreciated.

P.S.
Cross posted on nutch-dev and nutch-user.
The record file has been hosted on my googlepages merely for reference, no
intentions of spamming please.

-- 
Ankit Dangi

Mime
View raw message