nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Fuad Efendi" <f...@efendi.ca>
Subject RE: URL with Space
Date Fri, 04 Sep 2009 15:25:21 GMT
> From: Fuad Efendi 
> I already posted here that URL Normalizer is called after extracting
> Outlinks from a Page.

-I was _wrong_, sorry.


Code from Injector:
      try {
        url = urlNormalizers.normalize(url, URLNormalizers.SCOPE_INJECT);
        url = filters.filter(url);             // filter the url
      } catch (Exception e) {



You have to ensure that Nutch uses proper config file (with correct
normalizer)


Perl5Compiler in Java should use encoded \\s instead of \s; I am not sure if
one can use whitespace character inside XML node


P.S.
Some "normalizers" in NUTCH are synchronized singletons and you will have
obvious performance bottleneck.




Mime
View raw message