nutch-agent mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From adrian...@interfree.it
Subject crawl-urlfilter.txt
Date Wed, 14 Sep 2005 14:54:28 GMT

Hi,
thank you for your hints but I didn' give you the following information:

I modified the file crawl-urlfilter.txt in this mode:
#start crawl-urlfilter
# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|rtf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe)$

# skip URLs containing certain characters as probable queries, etc.
[EMAIL PROTECTED]

# accept anything else
+.
#end crawl-urlfilter


I started nutch with this line_command :
bin/nutch crawl urls -dir /home/paul/nutch-searcher.dir -depth 3 >& crawl.log

In the file "urls" there is the url of the following page:

<HTML>

<HEAD>
<TITLE>  TitleOfSite </TITLE>
</HEAD>

<FRAMESET ROWS="14%, *">

<FRAME NORESIZE NAME="MENU" SRC="MyServlet?menu=1" SCROLLING =AUTO">

<FRAME NAME="PAGE"  SRC="../welcome.html" SCROLLING=AUTO">

</FRAMESET>

</HTML>


Nutch crawls and fetchs "welcome.html"  but doesn't work with MyServlet?menu=1
The servlet "MyServlet?menu=1"  shows some links but in the log  nutch doesn't 
fetch  any of those links.
I hope the question is clear and am looking forward to receiving your answer.

                                         Adriano

-------------------------------------------------------------------------
Visita http://domini.interfree.it, il sito di Interfree dove trovare
soluzioni semplici e complete che soddisfano le tue esigenze in Internet,
ecco due esempi di offerte:

-  Registrazione Dominio: un dominio con 1 MB di spazio disco +  2 caselle
   email a soli 18,59 euro
-  MioDominio: un dominio con 20 MB di spazio disco + 5 caselle email 
   a soli 51,13 euro

Vieni a trovarci!

Lo Staff di Interfree 
-------------------------------------------------------------------------


Mime
View raw message