nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alex McLintock <alex.mclint...@gmail.com>
Subject Re: Why isn't this working?
Date Tue, 11 Aug 2009 09:35:44 GMT
I've been wondering about this problem. When you did the invertlinks
and index steps did you do it just on the current/most recent segment
or all the segments?

Presumably this is why you tried to do a merge?

Alex

2009/8/10 Paul Tomblin <ptomblin@xcski.com>:
> After applying the patch I sent earlier, I got it so that it correctly
> skips downloading pages that haven't changed.  And after doing the
> generate/fetch/updatedb loop, and merging the segments with mergeseg,
> dumping the segment file seems to show that it still has the old
> content as well as the new content.  But when I then ran the
> invertlinks and index step, the resulting index consists of very small
> files compared to the files from the previous crawl, indicating that
> it only indexed the stuff that it had newly fetched.  I tried the
> NutchBean, and sure enough it could only find things I knew were on
> the newly loaded pages, and couldn't find things that occur hundreds
> of times on the pages that haven't changed.  "merge" doesn't seem to
> help, since the resulting merged index is still the same size as
> before merging.

Mime
View raw message