Sorry to be asking so many questions.. Below is the current script I'm
using. It's indexing the segments.. so do I use invertlinks directly
after the fetch? I'm kind of confused.. thanks.
matt
-------------------------------------------------------
#!/bin/bash
# A simple script to run a Nutch re-crawl
if [ -n "$1" ]
then
crawl_dir=$1
else
echo "Usage: recrawl crawl_dir [depth] [adddays]"
exit 1
fi
if [ -n "$2" ]
then
depth=$2
else
depth=5
fi
if [ -n "$3" ]
then
adddays=$3
else
adddays=0
fi
webdb_dir=$crawl_dir/db
segments_dir=$crawl_dir/segments
index_dir=$crawl_dir/index
# The generate/fetch/update cycle
for ((i=1; i <= depth ; i++))
do
bin/nutch generate $webdb_dir $segments_dir -adddays $adddays
segment=`ls -d $segments_dir/* | tail -1`
bin/nutch fetch $segment
bin/nutch updatedb $webdb_dir $segment
done
# Update segments
mkdir tmp
bin/nutch updatesegs $webdb_dir $segments_dir tmp
rm -R tmp
# Index segments
for segment in `ls -d $segments_dir/* | tail -$depth`
do
bin/nutch index $segment
done
# De-duplicate indexes
# "bogus" argument is ignored but needed due to
# a bug in the number of args expected
bin/nutch dedup $segments_dir bogus
# Merge indexes
ls -d $segments_dir/* | xargs bin/nutch merge $index_dir
---------------------------------------------------------------
Stefan Neufeind wrote:
>You miss actually indexing the pages :-) This is done inside the
>"crawl"-command which does everything in one. After you fetched
>everything use:
>
>nutch invertlinks ...
>nutch index ...
>
>Hope that helps. Otherwise let me know and I'll dig out the complete
>commandlines for you.
>
>
>Regards,
> Stefan
>
>Matthew Holt wrote:
>
>
>>Just FYI.. After I do the recrawl, I do stop and start tomcat, and still
>>the newly created page can not be found.
>>
>>Matthew Holt wrote:
>>
>>
>>
>>>The recrawl worked this time, and I recrawled the entire db using the
>>>-adddays argument (in my case ./recrawl crawl 10 31). However, it
>>>didn't find a newly created page.
>>>
>>>If I delete the database and do the initial crawl over again, the new
>>>page is found. Any idea what I'm doing wrong or why it isn't finding it?
>>>
>>>Thanks!
>>>Matt
>>>
>>>Matthew Holt wrote:
>>>
>>>
>>>
>>>>Stefan,
>>>> Thanks a bunch! I see what you mean..
>>>>matt
>>>>
>>>>Stefan Neufeind wrote:
>>>>
>>>>
>>>>
>>>>>Matthew Holt wrote:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>Hi all,
>>>>>> I have already successfuly indexed all the files on my domain only
>>>>>>(as
>>>>>>specified in the conf/crawl-urlfilter.txt file).
>>>>>>
>>>>>>Now when I use the below script (./recrawl crawl 10 31) to recrawl
the
>>>>>>domain, it begins indexing pages off of my domain (such as wikipedia,
>>>>>>etc). How do I prevent this? Thanks!
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>Hi Matt,
>>>>>
>>>>>have a look at regex-urlfilter. "crawl" is special in some ways.
>>>>>Actually it's "shortcut" for several steps. And it has a special
>>>>>urlfilter-file. But if you do it in several steps that
>>>>>urlfilter-file is
>>>>>no longer used.
>>>>>
>>>>>
>
>
>
|