nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Niclas Rothman" <n...@lechill.com>
Subject NewbieNutcher.....
Date Tue, 04 Oct 2005 09:24:20 GMT
Hi all Nutch users!!!

I'm all new to the nutch crawl system and have now for a time tried to
make a successful crawl of a site without any "bigger success".

I have written a shell script to do the work (all from the tutorial),
however it seems like my roundtrips of the commands generate, fetch and
update "fails" or should I say when I try to run my second, third round
it doesn't find any URLs to fetch. My site gets only partly indexed.  

 

My script looks like;

 

************************************************************************
********************************************'

#Remove any directories since last test.

rm -r db

rm -r segments

#Create directories

mkdir db

mkdir segments

 

./nutch admin db -create

./nutch inject db -urlfile ../conf/root-urls.txt #This file contains
just one url, http://www.rivieradvd.com/home.htm)

 

./nutch generate db segments

s1=`ls -d segments/2* | tail -1`

./nutch fetch $s1

./nutch updatedb db $s1

            

for ((  i = 0 ;  i <= 5;  i++  ))

do

            ./nutch generate db segments -topN 1000

            s1=`ls -d segments/2* | tail -1`

            ./nutch fetch $s1

            ./nutch updatedb db $s1

 

done

./nutch dedup segments dedup.tmp

 

************************************************************************
********************************************'

 

My crawl-urlfilter file looks like; (should also fetch pages with
querystrings right?

 

************************************************************************
********************************************'

# The url filter file used by the crawl command.

 

# Better for intranet crawling.

# Be sure to change MY.DOMAIN.NAME to your domain name.

 

# Each non-comment, non-blank line contains a regular expression

# prefixed by '+' or '-'.  The first matching pattern in the file

# determines whether a URL is included or ignored.  If no pattern

# matches, the URL is ignored.

 

# skip file:, ftp:, & mailto: urls

-^(file|ftp|mailto):

 

# skip image and other suffixes we can't yet parse

-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|m
ov|MOV|exe|png|PNG|js)$

 

# skip URLs containing certain characters as probable queries, etc.

#-[*!@=]

 

# accept hosts in MY.DOMAIN.NAME

+^http://([a-z0-9]*\.)*rivieradvd.com/

 

# skip everything else

-.

************************************************************************
********************************************'

 

 

I hope I have given you the enough information for you to guide me on
the right track. 

Best regards Niclas

 

 

 


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message