nutch-agent mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bernd Eckenfels <be-mail2...@lina.inka.de>
Subject nutch gets forms?
Date Thu, 22 Sep 2005 04:37:51 GMT
Hello,

is there a reason why nuth based crawlers do post forms while traversing
links?


turingc.cs.washington.edu - - [14/Sep/2005:22:09:57 +0200] "GET
/lina/cgi-bin/freefire-mail.cgi HTTP/1.0" 302 230 "-" "NutchCVS/0.8-dev
(Nutch; http://lucene.apache.org/nutch/bot.html;
nutch-agent@lucene.apache.org)"
turingc.cs.washington.edu - - [14/Sep/2005:22:10:04 +0200] "GET
/lina/freefire-l/index.en.html HTTP/1.0" 302 204 "-" "NutchCVS/0.8-dev
(Nutch; http://lucene.apache.org/nutch/bot.html;
nutch-agent@lucene.apache.org)"

above logs is www.freefire.org which contains a form:

<form method="POST" ACTION="http://sites.inka.de/lina/cgi-bin/freefire-mail.cgi">

which leads to this:

turingc.cs.washington.edu - - [14/Sep/2005:22:09:57 +0200] "GET
/lina/cgi-bin/freefire-mail.cgi HTTP/1.0" 302 230 "-" "NutchCVS/0.8-dev
(Nutch; http://lucene.apache.org/nutch/bot.html;
nutch-agent@lucene.apache.org)"
turingc.cs.washington.edu - - [14/Sep/2005:22:10:04 +0200] "GET
/lina/freefire-l/index.en.html HTTP/1.0" 302 204 "-" "NutchCVS/0.8-dev
(Nutch; http://lucene.apache.org/nutch/bot.html;
nutch-agent@lucene.apache.org)"

So it is getting the POST URL instead of ignoring the form?

Gruss
Bernd
-- 
  (OO)     -- Bernd_Eckenfels@Mörscher_Strasse_8.76185Karlsruhe.de --
 ( .. )    ecki@{inka.de,linux.de,debian.org}  http://www.eckes.org/
  o--o   1024D/E383CD7E  eckes@IRCNet  v:+497211603874  f:+49721151516129
(O____O)  When cryptography is outlawed, bayl bhgynjf jvyy unir cevinpl!

Mime
View raw message