nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From MilleBii <mille...@gmail.com>
Subject Re: Help me, No urls to fetch.
Date Mon, 07 Sep 2009 07:17:55 GMT
Obviously you've checked crawl-filter.txt rules.
Beware there is a nasty thing that can happen : make sure there is a direct
CR/LF at the end of the rules, I had recently a problem because some
"invisible" spaces where following one rule and therefore this rule was
never matching... took me a while to figure out.


2009/9/7 zo tiger <zo.tiger@hotmail.com>

>
> This is my hadoop.log file's contents
>
>
> 2009-09-07 03:32:58,137 INFO  plugin.PluginRepository -         HTTP
> Framework (lib-http)
> 2009-09-07 03:32:58,137 INFO  plugin.PluginRepository -         Text Parse
> Plug-in (parse-text)
> 2009-09-07 03:32:58,137 INFO  plugin.PluginRepository -
> Pass-through
> URL Normalizer (urlnormalizer-pass)
> 2009-09-07 03:32:58,137 INFO  plugin.PluginRepository -         Regex URL
> Filter (urlfilter-regex)
> 2009-09-07 03:32:58,137 INFO  plugin.PluginRepository -         Http
> Protocol Plug-in (protocol-http)
> 2009-09-07 03:32:58,137 INFO  plugin.PluginRepository -         XML
> Response
> Writer Plug-in (response-xml)
> 2009-09-07 03:32:58,137 INFO  plugin.PluginRepository -         Regex URL
> Normalizer (urlnormalizer-regex)
> 2009-09-07 03:32:58,137 INFO  plugin.PluginRepository -         OPIC
> Scoring
> Plug-in (scoring-opic)
> 2009-09-07 03:32:58,137 INFO  plugin.PluginRepository -         CyberNeko
> HTML Parser (lib-nekohtml)
> 2009-09-07 03:32:58,137 INFO  plugin.PluginRepository -         Anchor
> Indexing Filter (index-anchor)
> 2009-09-07 03:32:58,137 INFO  plugin.PluginRepository -         JavaScript
> Parser (parse-js)
> 2009-09-07 03:32:58,137 INFO  plugin.PluginRepository -         URL Query
> Filter (query-url)
> 2009-09-07 03:32:58,137 INFO  plugin.PluginRepository -         Regex URL
> Filter Framework (lib-regex-filter)
> 2009-09-07 03:32:58,137 INFO  plugin.PluginRepository -         JSON
> Response Writer Plug-in (response-json)
> 2009-09-07 03:32:58,137 INFO  plugin.PluginRepository - Registered
> Extension-Points:
> 2009-09-07 03:32:58,137 INFO  plugin.PluginRepository -         Nutch
> Summarizer (org.apache.nutch.searcher.Summarizer)
> 2009-09-07 03:32:58,137 INFO  plugin.PluginRepository -         Nutch
> Protocol (org.apache.nutch.protocol.Protocol)
> 2009-09-07 03:32:58,137 INFO  plugin.PluginRepository -         Nutch
> Analysis (org.apache.nutch.analysis.NutchAnalyzer)
> 2009-09-07 03:32:58,137 INFO  plugin.PluginRepository -         Nutch Field
> Filter (org.apache.nutch.indexer.field.FieldFilter)
> 2009-09-07 03:32:58,138 INFO  plugin.PluginRepository -         HTML Parse
> Filter (org.apache.nutch.parse.HtmlParseFilter)
> 2009-09-07 03:32:58,138 INFO  plugin.PluginRepository -         Nutch Query
> Filter (org.apache.nutch.searcher.QueryFilter)
> 2009-09-07 03:32:58,138 INFO  plugin.PluginRepository -         Nutch
> Search
> Results Response Writer (org.apache.nutch.searcher.response.ResponseWriter)
>
>
> MilleBii wrote:
> >
> > Is there more information in logs/hadoop file ?
> >
> > What is your plug-in list ?
> >
> > 2009/9/2 zo tiger <zo.tiger@hotmail.com>
> >
> >>
> >> Thank you for your reply.
> >>
> >> In urls directory(exactly /nutch/search/urls) , there is a file
> >> urllist.txt.
> >>
> >> content is as following.
> >>
> >>      http://lucene.apache.org
> >>
> >> I don't understand why nutch can not fetch any url.
> >>
> >>
> >> Paul Tomblin wrote:
> >> >
> >> > On Wed, Sep 2, 2009 at 6:36 AM, zo tiger<zo.tiger@hotmail.com> wrote:
> >> >>
> >> >
> >> >> At last i ran bin/nutch crawl command but it gives
> >> >>
> >> >> No urls to fetch check your filter and seed list error
> >> >>
> >> >> I am sure there is no problem in crawl-url filter and other
> >> configuration
> >> >> xml files
> >> >>
> >> >> İs anyone know any possible problem????
> >> >>
> >> >
> >> > What's in your url directory?
> >> >
> >> >
> >> > --
> >> > http://www.linkedin.com/in/paultomblin
> >> >
> >> >
> >>
> >> --
> >> View this message in context:
> >>
> http://www.nabble.com/Help-me%2C-No-urls-to-fetch.-tp25255142p25255944.html
> >> Sent from the Nutch - User mailing list archive at Nabble.com.
> >>
> >>
> >
> >
> > --
> > -MilleBii-
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/Help-me%2C-No-urls-to-fetch.-tp25255142p25324884.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>


-- 
-MilleBii-

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message