nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From cervenkovab <cervenko...@gmail.com>
Subject Re: Nutch 2.1 different batch id (null)
Date Sun, 28 Apr 2013 15:33:57 GMT
Hallo,
I have the same problem with *"Skipping some.relevant.page.com; different
batch id (null)"* for a lot of pages. My configuration is almost the same as
bellow (only different OS and storage is Hbase).

I do the steps (inject), generate, fetch, and the skipping appears in parse
phase. But I want those pages to be parsed, the urls are relevant for me.  		
There is a problem that I want to crawl a lot of websites. *When a lot of
pages are skipped, I have very few collected pages, many empty pages and it
is bad for me*. And I also dont know why the page for example
/http://videos.arte.tv/de/videos/arte-reportage--7471210.html/ is fetched
and parsed and for example the page
/http://videos.arte.tv/de/videos/real-humans-echte-menschen-7-10-achtung-schockierende-bilder--7455402.html/
is skipped and most of the other pages of the domain /arte.tv/ is skipped.
It is the same domain name. 

*What causes this error? How can I resolve this problem?*
Thanks for help





--
View this message in context: http://lucene.472066.n3.nabble.com/Nutch-2-1-different-batch-id-null-tp4040592p4059636.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Mime
View raw message