nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dennis Kubes <nutch-...@dragonflymc.com>
Subject Re: Help please trying to crawl local file system
Date Fri, 06 Apr 2007 03:56:18 GMT
Did you set the agent name in the nutch configuration.  I think even 
when crawling only the local file system the agent name still needs to 
be set.  If not set I believe nothing is fetched and errors are thrown 
but you would only see this if your logging was setup for it.

Dennis Kubes

jim shirreffs wrote:
> I googled and googled and goolged I am trying to crawl my local file 
> system and can't seem to get it right.
> 
> I use this command
> 
> bin/mutch crawl urls -dir crawl
> 
> My urls dir contains one file (files) that looks like this
> 
> file:///c:/joms
> 
> c:/joms exists
> 
> I've modified the config file crawl-urlfilter.txt
> 
> #-^(file|ftp|mailto|sw|swf):
> -^(http|ftp|mailto|sw|swf):
> 
> # skip everything else ..... web spaces
> #-.
> +.*
> 
> 
> And the config file nutch-site.xml adding
> 
> <property>
>  <name>plugin.includes</name>
>  <value>protocol-file|urlfilter-regex|parse-(xml|text|html|js|pdf)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic</value>

> 
> </property>
> <property>
>  <name>file.content.limit</name>
>  <value>-1</value>
> </property>
> </configuration>
> 
> 
> And lastly I've modified regex-urlfilter.txt
> #file systems
> +^file:///c:/top/directory/
> -.
> 
> # skip file: ftp: and mailto: urls
> #-^(file|ftp|mailto):
> -^(http|ftp|mailto):
> 
> # skip image and other suffixes we can't yet parse
> -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe)$

> 
> 
> # skip URLs containing certain characters as probable queries, etc.
> -[?*!@=]
> 
> # skip URLs with slash-delimited segment that repeats 3+ times, to break 
> loops
> -.*(/.+?)/.*?\1/.*?\1/
> 
> # accept anything else
> +.
> 
> 
> I don't get any errors but nothing gets crawled either. If anyone can 
> point out my mistake(s) I would greatly appreciate it.
> 
> thanks in advance
> 
> jim s
> 
> 
> ps it would also be nice to know this email is getting into the 
> nutch-users mailing list
> 
> 
> 
> 

Mime
View raw message