nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vishal Shah" <vish...@rediff.co.in>
Subject RE: adding new URLs to nutch index
Date Mon, 04 Sep 2006 12:23:07 GMT
Hi Dima,

  Which version of Nutch are you using? From 0.8 onwards, the name of
the urls file has to be urls.txt, and it's parent dir has to be passed
to inject. For e.g., if your urls.txt is in a dir called NewUrls, then
your inject cmd would be:

bin/nutch inject crawl/crawldb NewUrls

Also, check your crawl-urlfilter.txt to make sure that these new URLs
won't be filtered.

Regards,

-vishal.

-----Original Message-----
From: Dima Gritsenko [mailto:dima@ekreative.com] 
Sent: Monday, September 04, 2006 3:36 PM
To: nutch-user@lucene.apache.org
Subject: adding new URLs to nutch index

Hi, 

We are indexing DMOZ + we want to add too other URLs for indexing and
seem to have a problem searching those 2 newly added URLs (no results
returned). 
Here's what we do to add new URL to nutch index:
1) Created a dir  /url with "url" file that contains these two URLs:
    http://www.newsvine.com/_feeds/rss2/index
    http://www.technorati.com/blogs/

2) Then the following command is run (it should be adding our extra URLs
to nutch DB/index)
    bin/nutch inject crawl/crawldb urls

3) Then start recrawl
    bin/recrawl /home/dima/workspace/hapool/ /usr/share/nutch-0.8/crawl/
3 0
 
We are also using index-url-category plugin that ascribes URLs to
different categories for future filtered search:
Here's what we do:

Add patterns used in regex-urlfilter.txt

# accept anything else
+^http:\/\/www\.technorati\.com\/blogs.*
+.*rss.*

-.

Add patterns used in crawl-urlfilter.txt

# accept hosts in MY.DOMAIN.NAME
+^http:\/\/www\.technorati\.com\/blogs.*
+.*rss.*


# skip everything else
-.


Patterns used in index-url-category plugin 

rules.properties file

# News
http://newsrss.bbc.co.uk/rss/*=news
http://www.newsvine.com/*=news
.*rss.*=news
.*\.xml=news

# Blogs
.*technorati\.com\/blogs.*=blogs

# Web
.*=web

Thank you. 
Dima. 




Mime
View raw message