nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Tomblin <ptomb...@xcski.com>
Subject Re: Why isn't this working?
Date Tue, 11 Aug 2009 11:58:07 GMT
I followed the script (with minor variations) in the wiki at
http://wiki.apache.org/nutch/Crawl
however, I think I found another bug.  Apply this patch and it will
index pages with a status of STATUS_FETCH_NOTMODIFIED as well as
STATUS_FETCH_SUCCESS.

Index: src/java/org/apache/nutch/indexer/IndexerMapReduce.java
===================================================================
--- src/java/org/apache/nutch/indexer/IndexerMapReduce.java	(revision 802632)
+++ src/java/org/apache/nutch/indexer/IndexerMapReduce.java	(working copy)
@@ -84,8 +84,10 @@
         if (CrawlDatum.hasDbStatus(datum))
           dbDatum = datum;
         else if (CrawlDatum.hasFetchStatus(datum)) {
-          // don't index unmodified (empty) pages
-          if (datum.getStatus() != CrawlDatum.STATUS_FETCH_NOTMODIFIED)
+          /*
+           * Where did this person get the idea that unmodified pages
are empty?
+           // don't index unmodified (empty) pages
+          if (datum.getStatus() != CrawlDatum.STATUS_FETCH_NOTMODIFIED) */
             fetchDatum = datum;
         } else if (CrawlDatum.STATUS_LINKED == datum.getStatus() ||
                    CrawlDatum.STATUS_SIGNATURE == datum.getStatus()) {
@@ -108,7 +110,7 @@
     }

     if (!parseData.getStatus().isSuccess() ||
-        fetchDatum.getStatus() != CrawlDatum.STATUS_FETCH_SUCCESS) {
+        (fetchDatum.getStatus() != CrawlDatum.STATUS_FETCH_SUCCESS &&
fetchDatum.getStatus() != CrawlDatum.STATUS_FETCH_NOTMODIFIED)) {
       return;
     }

Index: src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java
===================================================================
--- src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java	(revision
802632)
+++ src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java	(working
copy)
@@ -124,11 +124,15 @@
         reqStr.append("\r\n");
       }

-      reqStr.append("\r\n");
       if (datum.getModifiedTime() > 0) {
         reqStr.append("If-Modified-Since: " +
HttpDateFormat.toString(datum.getModifiedTime()));
         reqStr.append("\r\n");
       }
+      else if (datum.getFetchTime() > 0) {
+          reqStr.append("If-Modified-Since: " +
HttpDateFormat.toString(datum.getFetchTime()));
+          reqStr.append("\r\n");
+      }
+      reqStr.append("\r\n");

       byte[] reqBytes= reqStr.toString().getBytes();



On Tue, Aug 11, 2009 at 5:35 AM, Alex McLintock<alex.mclintock@gmail.com> wrote:
> I've been wondering about this problem. When you did the invertlinks
> and index steps did you do it just on the current/most recent segment
> or all the segments?
>
> Presumably this is why you tried to do a merge?
>
> Alex
>
> 2009/8/10 Paul Tomblin <ptomblin@xcski.com>:
>> After applying the patch I sent earlier, I got it so that it correctly
>> skips downloading pages that haven't changed.  And after doing the
>> generate/fetch/updatedb loop, and merging the segments with mergeseg,
>> dumping the segment file seems to show that it still has the old
>> content as well as the new content.  But when I then ran the
>> invertlinks and index step, the resulting index consists of very small
>> files compared to the files from the previous crawl, indicating that
>> it only indexed the stuff that it had newly fetched.  I tried the
>> NutchBean, and sure enough it could only find things I knew were on
>> the newly loaded pages, and couldn't find things that occur hundreds
>> of times on the pages that haven't changed.  "merge" doesn't seem to
>> help, since the resulting merged index is still the same size as
>> before merging.
>



-- 
http://www.linkedin.com/in/paultomblin

Mime
View raw message