nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anurag <anurag.it.jo...@gmail.com>
Subject Re: Help: Crawl returns no URLs
Date Mon, 07 Mar 2011 12:07:52 GMT
How u know that u are not using urlfilter.txt?  Fetching "0" records tells
that no url has been selected or url mentioned is wrong one....try to find
the error in those all such files where such things about domain name is
mentioned as for e.g. , /nutch-1.0/conf/regex-urlfilter.txt
nutch-1.0/conf/prefix-urlfilter.txt
nutch-1.0/conf/crawl-urlfilter.txt

try these....
On Mon, Mar 7, 2011 at 8:49 AM, chidu r [via Lucene] <
ml-node+2644587-295056780-146354@n3.nabble.com> wrote:

> Hi all
>
> I am trying to setup nutch 1.2 on Hadoop and used the instructions at
> http://wiki.apache.org/nutch/NutchHadoopTutorial, it has been very useful.
>
>
> However, I find that when I execute the command:
>
> $bin/nutch crawl urls -dir crawl -depth 4 -topN 50
>
> The crawler stops at the generator stage with the message:
> 2011-03-06 17:23:49,538 WARN  crawl.Generator - Generator: 0 records
> selected for fetching, exiting ...
>
> I have configured the following plugins in nutch-site.xml
>  protocol-http|parse-(text|html|js)|urlnormalizer-(pass|regex|basic)|urlfilter-regex|index-(basic|anchor)
>
>
> I am not using crawl-urlfilter.txt or regex-urlfilter.txt tp filter URLs. I
>
> initiated the crawl with 10 seed urls from popular sites on internet.
>
> Any pointers to what I am missing here?
>
>
> regards
> Chidu
>
>
> ------------------------------
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://lucene.472066.n3.nabble.com/Help-Crawl-returns-no-URLs-tp2644587p2644587.html
>  To start a new topic under Nutch - User, email
> ml-node+603147-156023097-146354@n3.nabble.com
> To unsubscribe from Nutch - User, click here<http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=603147&code=YW51cmFnLml0LmpvbGx5QGdtYWlsLmNvbXw2MDMxNDd8LTIwOTgzNDQxOTY=>.
>
>



-- 
Kumar Anurag


-----
Kumar Anurag

--
View this message in context: http://lucene.472066.n3.nabble.com/Help-Crawl-returns-no-URLs-tp2644587p2645916.html
Sent from the Nutch - User mailing list archive at Nabble.com.
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message