nutch-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Nutch Wiki] Update of "FAQ" by ra
Date Fri, 10 Aug 2007 13:30:21 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The following page has been changed by ra:
http://wiki.apache.org/nutch/FAQ

------------------------------------------------------------------------------
  
  The crawl tool expects as its first parameter the folder name where the seeding urls file
is located so for example if your urls.txt is located in /nutch/seeds the crawl command would
look like: crawl seed -dir /user/nutchuser...
  
- ==== Some pages are not indexed but my regex file and everyhing else is okay - what is going
on? ====
+ ==== Some pages are not indexed but my regex file and everything else is okay - what is
going on? ====
  The crawl tool has a default limitation of 100 outlinks of one page that are being fetched.
- To overcome this limitation change the property to a higher value or simply -1.
+ To overcome this limitation change the property to a higher value or simply -1 (unlimited).
  
  file: conf/nutch-default.xml
+ 
  {{{
   <property>
     <name>db.max.outlinks.per.page</name>
@@ -415, +416 @@

   </property> 
  }}}
  see also: http://www.mail-archive.com/nutch-user@lucene.apache.org/msg08665.html
- 
+ (tested under nutch 0.9)
  
  
  

Mime
View raw message