nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Achee <mark.ac...@usm.edu>
Subject Re: Getting original URL for redirect
Date Wed, 04 May 2011 20:54:56 GMT
Backwards from what you want, but may help.  Using the original URL:

bin/nutch readdb output/crawldb -url 'http://example.org/original/url/'

Replace "output" with the name of your crawl output directory.  If it was
redirected, the "Metadata" will say "moved" and show you where.  If there
were multiple redirects, you'll have to do this multiple times.

-Mark


On Thu, Apr 21, 2011 at 5:23 PM, Chris Woolum <cwoolum@moonvalley.com>wrote:

> Hey Everyone,
>
>
> I am doing some crawling in which I need to match my crawl data back up
> to my original url set but the problem is that in the case of a
> redirect, only the new URL is saved. Is there any way to get the
> original URL that started the crawl of the redirect?
>
> Thanks, Chris
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message