nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ankit Dangi <dangian...@gmail.com>
Subject SegmentReader: How to write content to separate multiple files..
Date Mon, 17 Aug 2009 09:35:29 GMT
Hello All,

After performing a crawl using Nutch, I wanted to read the content of all
the crawled URLs. I performed the following command: "$NUTCH_HOME/bin/nutch
readseg -dump $segment myseg"; where, $segment contains the name of the
segment file, and 'myseg' is the name of the directory where the dump of the
segment is created. I have noticed that a complete dump of all the crawled
URLs have been placed in one file named 'dump' within the directory 'myseg'.

I wanted to write the content of each URL into separate files. i.e. if there
are N urls, then I wanted the content of N parsed URLs to be written to N
files. I know, I can use 'wget' but want to do the same using Nutch.

I found no means to do the same, hence I looked at the source code, in the
Java Class, org.apache.nutch.segment.SegmentReader, In function dump(Path
segment, Path output) Line no. 222, I replaced
"job.setOutputFormat(TextOutputFormat.class);" with
"job.setOutputFormat(MultipleTextOutputFormat.class);" Ref:
http://hadoop.apache.org/common/docs/r0.19.1/api/org/apache/hadoop/mapred/lib/MultipleOutputFormat.html

After performing a crawl process again, I get the same output as earlier. Is
there a direct way, by which I can write the content of each URL into
separate files? Else, I shall write a program to parse the single 'dump'
file and write the content to separate files which indeed doesn't seem to be
appropriate. Does Nutch have a direct way?

Cross-posted on nutch-user and nutch-dev mailing lists.

-- 
Ankit Dangi

Mime
View raw message