nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Fuad Efendi" <>
Subject RE: URL with Space
Date Fri, 04 Sep 2009 15:25:21 GMT
> From: Fuad Efendi 
> I already posted here that URL Normalizer is called after extracting
> Outlinks from a Page.

-I was _wrong_, sorry.

Code from Injector:
      try {
        url = urlNormalizers.normalize(url, URLNormalizers.SCOPE_INJECT);
        url = filters.filter(url);             // filter the url
      } catch (Exception e) {

You have to ensure that Nutch uses proper config file (with correct

Perl5Compiler in Java should use encoded \\s instead of \s; I am not sure if
one can use whitespace character inside XML node

Some "normalizers" in NUTCH are synchronized singletons and you will have
obvious performance bottleneck.

View raw message