nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vishal Shah" <>
Subject RE: adding new URLs to nutch index
Date Mon, 04 Sep 2006 12:23:07 GMT
Hi Dima,

  Which version of Nutch are you using? From 0.8 onwards, the name of
the urls file has to be urls.txt, and it's parent dir has to be passed
to inject. For e.g., if your urls.txt is in a dir called NewUrls, then
your inject cmd would be:

bin/nutch inject crawl/crawldb NewUrls

Also, check your crawl-urlfilter.txt to make sure that these new URLs
won't be filtered.



-----Original Message-----
From: Dima Gritsenko [] 
Sent: Monday, September 04, 2006 3:36 PM
Subject: adding new URLs to nutch index


We are indexing DMOZ + we want to add too other URLs for indexing and
seem to have a problem searching those 2 newly added URLs (no results
Here's what we do to add new URL to nutch index:
1) Created a dir  /url with "url" file that contains these two URLs:

2) Then the following command is run (it should be adding our extra URLs
to nutch DB/index)
    bin/nutch inject crawl/crawldb urls

3) Then start recrawl
    bin/recrawl /home/dima/workspace/hapool/ /usr/share/nutch-0.8/crawl/
3 0
We are also using index-url-category plugin that ascribes URLs to
different categories for future filtered search:
Here's what we do:

Add patterns used in regex-urlfilter.txt

# accept anything else


Add patterns used in crawl-urlfilter.txt

# accept hosts in MY.DOMAIN.NAME

# skip everything else

Patterns used in index-url-category plugin file

# News*=news*=news

# Blogs

# Web

Thank you. 

View raw message