nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vineet Garg <vine...@CoWare.com>
Subject Re: Nutch fetching skipped files
Date Fri, 04 Apr 2008 07:17:38 GMT
Hi

Thanks for the response. Maybe I was not clear in expressing myself.

I am crawling a parent directory in my 'home' on Linux machine therefore my
urls have to begin with file: and not http:. I have defined the file
protocol and the crawl too is okay. My question is though I have modified
the crawl-urlfilter.xml to skip certain file types (or extensions like .css,
pdf, xml, php and so on)  why is the crawl still looking for those file
types and throwing errors? How can I avoid this because it is unnecessarily
looking for file types that I have specified to be skipped. This is simply
wastage of time. 

Our requirement is to perform crawl and index two different directories
residing in our product installation, therefore both my urls begin with
file:///.

My second query is:

Before I deploy nutch to tomcat if I run a NutchBean command to test the
crawl it always gives 0 hits or a single hit and displays an xml file name.
As mentioned earlier I have modified the urlfilter.txt to skip the .xml
types still only an xml is displayed. Any idea why? Of course after
deployment when I perform a search I get the required number of hits. Where
could I be going wrong?



Susam Pal wrote:
> Find my reply inline.
>
> On Wed, Apr 2, 2008 at 5:04 PM, Vineet Garg <vineetg@coware.com> wrote:
>   
>> Hi,
>>  I am using Nutch to crawl local file system. I am crawling by  bin/nutch
>> crawl urls -dir crawl -depth 5 -topN 500 > & crawl.log.
>>  But nutch is fetching files e.g. .css or .png files which i have set to be
>> skipped in crawl-urlfilter.txt file and throwing error while parsing:
>>
>>  fetching file:/hm/vineetg/SPD38/libraries/old_spb/.spw_hidden
>>  fetching file:/hm/vineetg/SPD38/share/doc/spwlibcomm/catalog.css
>>  fetching file:/hm/vineetg/SPD38/libraries/example_hiers/.spw_hidden
>>  fetching
>> file:/hm/vineetg/SPD38/share/doc/spwlibcomm/spwlibcommPreface_2.html
>>  fetching file:/hm/vineetg/SPD38/share/doc/spweul/images/
>>  fetching file:/hm/vineetg/SPD38/share/doc/spwfds/wwhdata/
>>  fetching
>> file:/hm/vineetg/SPD38/share/doc/spw2xml/DBMigrator_methodology_5.html
>>  fetching
>> file:/hm/vineetg/SPD38/share/doc/spwtutorial_advanced/spwtutorial_advancedPreface_4.html
>>  fetching file:/hm/vineetg/SPD38/share/doc/spwcsc/title_1.html
>>  fetching file:/hm/vineetg/SPD38/share/doc/spwveis136/images/
>>  Error parsing: file:/hm/vineetg/SPD38/libraries/old_spb/.spw_hidden:
>> failed(2,200): org.apache.nutch.parse.ParseException: parser not found for
>> contentType= url=file:/hm/vineetg/SPD38/libraries/old_spb/.spw_hidden
>>  fetching file:/hm/vineetg/SPD38/share/doc/spwlibwlan/chap1_6.html
>>  Error parsing: file:/hm/vineetg/SPD38/libraries/example_hiers/.spw_hidden:
>> failed(2,200): org.apache.nutch.parse.ParseException: parser not found for
>> contentType= url=file:/hm/vineetg/SPD38/libraries/example_hiers/.spw_hidden
>>  fetching file:/hm/vineetg/SPD38/libraries/cdma_rtl/fir/
>>  Error parsing: file:/hm/vineetg/SPD38/share/doc/spwlibcomm/catalog.css:
>> failed(2,200): org.apache.nutch.parse.ParseException: parser not found for
>> contentType=text/css
>> url=file:/hm/vineetg/SPD38/share/doc/spwlibcomm/catalog.css
>>
>>
>>  my crawl-urlfilter file is:# The url filter file used by the crawl command.
>>
>>  # Better for intranet crawling.
>>  # Be sure to change MY.DOMAIN.NAME to your domain name.
>>
>>  # Each non-comment, non-blank line contains a regular expression
>>  # prefixed by '+' or '-'.  The first matching pattern in the file
>>  # determines whether a URL is included or ignored.  If no pattern
>>  # matches, the URL is ignored.
>>
>>  # skip http:, ftp:, & mailto: urls
>>  #-^(http|ftp|mailto):
>>  +^(file|ftp|mailto):
>>     
>
> You have allowed URLs beginning with "file:". Since, this is the first
> regular expression that matches with the URLs being crawled, the rest
> of the crawl-urlfilter.txt is ignored. If you read the comments in
> this file, you'll find that it says, "The first matching pattern in
> the file determines whether a URL is included or ignored."
>
> Hope this helps.
>
> Regards,
> Susam Pal
>
>   
>>
>>  # skip image and other suffixes we can't yet parse
>>
>> -\.(css|gif|GIF|jpg|JPG|png|PNG|ico|ICO|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
>>
>>  What could be the reason??
>>
>>  Regards,
>>  Vineet
>>
>>     
>
>   


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message