nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matthew Holt <mh...@redhat.com>
Subject Re: Recrawling question
Date Tue, 06 Jun 2006 20:58:59 GMT
Sorry to be asking so many questions.. Below is the current script I'm 
using. It's indexing the segments.. so do I use invertlinks directly 
after the fetch? I'm kind of confused.. thanks.
matt

-------------------------------------------------------
#!/bin/bash

# A simple script to run a Nutch re-crawl

if [ -n "$1" ]
then
  crawl_dir=$1
else
  echo "Usage: recrawl crawl_dir [depth] [adddays]"
  exit 1
fi

if [ -n "$2" ]
then
  depth=$2
else
  depth=5
fi

if [ -n "$3" ]
then
  adddays=$3
else
  adddays=0
fi

webdb_dir=$crawl_dir/db
segments_dir=$crawl_dir/segments
index_dir=$crawl_dir/index

# The generate/fetch/update cycle
for ((i=1; i <= depth ; i++))
do
  bin/nutch generate $webdb_dir $segments_dir -adddays $adddays
  segment=`ls -d $segments_dir/* | tail -1`
  bin/nutch fetch $segment
  bin/nutch updatedb $webdb_dir $segment
done

# Update segments
mkdir tmp
bin/nutch updatesegs $webdb_dir $segments_dir tmp
rm -R tmp

# Index segments
for segment in `ls -d $segments_dir/* | tail -$depth`
do
  bin/nutch index $segment
done

# De-duplicate indexes
# "bogus" argument is ignored but needed due to
# a bug in the number of args expected
bin/nutch dedup $segments_dir bogus

# Merge indexes
ls -d $segments_dir/* | xargs bin/nutch merge $index_dir

---------------------------------------------------------------

Stefan Neufeind wrote:

>You miss actually indexing the pages :-) This is done inside the
>"crawl"-command which does everything in one. After you fetched
>everything use:
>
>nutch invertlinks ...
>nutch index ...
>
>Hope that helps. Otherwise let me know and I'll dig  out the complete
>commandlines for you.
>
>
>Regards,
> Stefan
>
>Matthew Holt wrote:
>  
>
>>Just FYI.. After I do the recrawl, I do stop and start tomcat, and still
>>the newly created page can not be found.
>>
>>Matthew Holt wrote:
>>
>>    
>>
>>>The recrawl worked this time, and I recrawled the entire db using the
>>>-adddays argument (in my case ./recrawl crawl 10 31). However, it
>>>didn't find a newly created page.
>>>
>>>If I delete the database and do the initial crawl over again, the new
>>>page is found. Any idea what I'm doing wrong or why it isn't finding it?
>>>
>>>Thanks!
>>>Matt
>>>
>>>Matthew Holt wrote:
>>>
>>>      
>>>
>>>>Stefan,
>>>> Thanks a bunch! I see what you mean..
>>>>matt
>>>>
>>>>Stefan Neufeind wrote:
>>>>
>>>>        
>>>>
>>>>>Matthew Holt wrote:
>>>>> 
>>>>>
>>>>>          
>>>>>
>>>>>>Hi all,
>>>>>> I have already successfuly indexed all the files on my domain only
>>>>>>(as
>>>>>>specified in the conf/crawl-urlfilter.txt file).
>>>>>>
>>>>>>Now when I use the below script (./recrawl crawl 10 31) to recrawl
the
>>>>>>domain, it begins indexing pages off of my domain (such as wikipedia,
>>>>>>etc). How do I prevent this? Thanks!
>>>>>>  
>>>>>>            
>>>>>>
>>>>>
>>>>>Hi Matt,
>>>>>
>>>>>have a look at regex-urlfilter. "crawl" is special in some ways.
>>>>>Actually it's "shortcut" for several steps. And it has a special
>>>>>urlfilter-file. But if you do it in several steps that
>>>>>urlfilter-file is
>>>>>no longer used.
>>>>>          
>>>>>
>
>  
>

Mime
View raw message