nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From reinhard schwab <reinhard.sch...@aon.at>
Subject Re: LinkDB size difference
Date Tue, 01 Sep 2009 09:48:55 GMT
you can dump the linkdb and analyze where it differs.
my guess is, that you have different urls there because crawl uses
crawl-urlfilter.txt to filter urls
and fetch uses regex-urlfilter.txt.
so different filters.
i cant explain why. i have not implemented this. i have only experienced
the difference myself.

how to dump the linkdb:

reinhard@thord:>bin/nutch readlinkdb
Usage: LinkDbReader <linkdb> {-dump <out_dir> | -url <url>)
        -dump <out_dir> dump whole link db to a text file in <out_dir>
        -url <url>      print information about <url> to System.out




Hrishikesh Agashe schrieb:
> Hi,
>
> I am observing that the size of LinkDB is different when I do a run for same URLs with
"crawl" command(intranet crawling) as compared to running individual commands (like inject,
generate, fetch, invertlink etc i.e. Internet crawl)
> Are there any parameters that Nutch passes to invertlink while running with "crawl" option?
>
> TIA,
> --Hrishi
>
> DISCLAIMER
> ==========
> This e-mail may contain privileged and confidential information which is the property
of Persistent Systems Ltd. It is intended only for the use of the individual or entity to
which it is addressed. If you are not the intended recipient, you are not authorized to read,
retain, copy, print, distribute or use this message. If you have received this communication
in error, please notify the sender and delete all copies of this message. Persistent Systems
Ltd. does not accept any liability for virus infected mails.
>
>   


Mime
View raw message