nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris Mattmann" <chris.mattm...@jpl.nasa.gov>
Subject RE: How do I enable PDF/Word etc. parsing in nutch?
Date Mon, 02 May 2005 17:38:20 GMT
Hi Jason,

 Step 1: edit <nutch_home>/nutch-default.xml and edit the following lines:

<property>
  <name>plugin.includes</name>

<!-- enable your plugins here -->
 
<value>protocol-(http|file)|urlfilter-regex|parse-(text|html|rss|msword|pdf)
|index-basic|query-(basic|site|url)</value\
>
  <description>Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.  By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins.
  </description>
</property>

 Step 2: make sure that the plugin is built:

  From the <nutch_home> directory, perform the following:
 
  # ensure that the core classes are built
  % ant compile-core

  # ensure that the plugins are built
  % ant compile-plugins

Note, that the compile-plugins task assumes that your plugin build info is
in <nutch_home>/src/plugin/build.xml, so if you're building a new plugin,
you'll have to add the ant compile info there, just follow the examples of
the other plugins.

 Step 3: you're done.


Good luck.


Thanks,
  Chris



______________________________________________
Chris A. Mattmann
Chris.Mattmann@jpl.nasa.gov 
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_________________________________________________
Jet Propulsion Laboratory            Pasadena, CA
Office: 171-266B                        Mailstop:  171-246

_______________________________________________________

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.


> -----Original Message-----
> From: Jason Manfield [mailto:rarish911@yahoo.com]
> Sent: Monday, May 02, 2005 10:24 AM
> To: nutch-user@incubator.apache.org
> Subject: How do I enable PDF/Word etc. parsing in nutch?
> 
> 
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com


Mime
View raw message