nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Tomblin <ptomb...@xcski.com>
Subject Isn't this a bug?
Date Tue, 01 Sep 2009 15:08:03 GMT
If I crawl a page with a url like:
http://localhost/Documents/pharma/DocSamples/?C=N;O=A
(which is what you get when you have a directory without an index.*,
and you've configured "Options Indexes", and you click one of the
sorting options)
and it presents all the files in the directory as relative links like
"foo.html", Nutch ends up trying to fetch the files with the second
part of that same parameter on the end, like "foo.htmlO=A", which ends
up getting a 404.

Look at the parse data for http://localhost/Documents/pharma/DocSamples/?C=D;O=A
...
     [java]   outlink: toUrl:
http://localhost/Documents/pharma/DocSamples/15%20minutes.htm;O=A
anchor: 15 minutes.htm
     [java]   outlink: toUrl:
http://localhost/Documents/pharma/DocSamples/18whistle.html;O=A
anchor: 18whistle.html
     [java]   outlink: toUrl:
http://localhost/Documents/pharma/DocSamples/2010%20brings%20changes.doc;O=A
anchor: 2010 brings changes.doc
...

-- 
http://www.linkedin.com/in/paultomblin

Mime
View raw message