nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Asim Baig <a...@catsone.com>
Subject Only index word and pdf files on a site
Date Fri, 16 Feb 2007 04:20:05 GMT
I just installed nutch and have been trying to understand the filters.

I need to crawl 1 fairly large site (http://www.largesite.com) that I 
know contains hundreds of MS Word and pdf files. All I want to do is to 
index the .doc and .pdf files. I don't want to index the HTML pages 
containing links to these 2 document types. I don't want to index 
*anything* except .doc and .pdf files.

What do I need to do? Where do I start.



-- 
Asim Baig
Cognizo Technologies, Inc.
10501 Wayzata Blvd., Suite 100
Minnetonka, MN 55305
p: (952) 417-0067 x101
f: (952) 417-0068
c: (612) 382-7474
e: asim@cognizo.com

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message