hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "prem kumar" <prem.kuma...@gmail.com>
Subject nutch to search local filesystem
Date Mon, 29 Oct 2007 04:57:01 GMT
Hi All,
I tried setting up a local filesystem crawl through nutch-0.9. I am facing
problems trying this.
Following are the details:


------------------------------

CRAWL OUTPUT:

Found 1 items

/user/test/urls      <dir>
crawl started in: crawled
rootUrlDir = urls
threads = 10
depth = 3
topN = 5
Injector: starting
Injector: crawlDb: crawled/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.

Injector: Merging injected urls into crawl db.
Injector: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: crawled/segments/20071026235539
Generator: filtering: false

Generator: topN: 5
Generator: 0 records selected for fetching, exiting ...
Stopping at depth=0 - no more URLs to fetch.
No URLs to fetch - check your seed list and URL filters.
crawl finished: crawled



urls/seed file :

file:///export/home/test/test/tmp

file:///export/home/test/test/search
file:///export/home/test/test/tmp


conf/crawl-urlfilter.txt
:

# The url filter file used by the crawl command.

# Better for intranet crawling.
# Be sure to change MY.DOMAIN.NAME to your domain name.

# Each non-comment, non-blank line contains a regular expression

# prefixed by '+' or '-'.  The first matching pattern in the file
# determines whether a URL is included or ignored.  If no pattern
# matches, the URL is ignored.

## skip file:, ftp:, & mailto: urls

##-^(file|ftp|mailto):
# skip http:, ftp:, & mailto: urls
-^(http|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$


# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
#-.*(/.+?)/.*?\1/.*?\1/

# accept hosts in
MY.DOMAIN.NAME
#+^http://([a-z0-9 <http://%28%5ba-z0-9/>]*\.)*com/

# skip everything else for http
#-.*
# take everything else for file

+.*



conf/nutch-site.xml:

<configuration>
<property>
  <name>plugin.folders</name>
  <value>/export/home/test/test/nutch/build/plugins</value>

<description>Directories where nutch plugins are located.  Each

  element may be a relative or absolute path.  If absolute, it is used
  as is.  If relative, it is searched for on the classpath.</description>
</property>
<property>
  <name>plugin.includes
</name>
  <value>protocol-file|urlfilter-regex|parse-(text|html|msword|pdf)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
  <description>Regular expression naming plugin directory names to

  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints plugin. By
  default Nutch includes crawling just HTML and plain text via HTTP,

  and basic indexing and search plugins. In order to use HTTPS please enable
  protocol-httpclient, but be aware of possible intermittent problems with the
  underlying commons-httpclient library.
  </description>

</property>
</configuration>
~

Any hints on how to proceed further ?
Prem

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message