nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Naomi Dushay" <Na...@cs.cornell.edu>
Subject RE: How do I enable PDF/Word etc. parsing in nutch?
Date Mon, 02 May 2005 17:58:03 GMT
One thing:

Create a <nutch_home>/nutch-site.xml instead of modifying nutch-default.xml

- Naomi

> -----Original Message-----
> From: Chris Mattmann [mailto:chris.mattmann@jpl.nasa.gov]
> Sent: Monday, May 02, 2005 1:38 PM
> To: nutch-user@incubator.apache.org
> Subject: RE: How do I enable PDF/Word etc. parsing in nutch?
> 
> Hi Jason,
> 
>  Step 1: edit <nutch_home>/nutch-default.xml and edit the following lines:
> 
> <property>
>   <name>plugin.includes</name>
> 
> <!-- enable your plugins here -->
> 
> <value>protocol-(http|file)|urlfilter-regex|parse-
> (text|html|rss|msword|pdf)
> |index-basic|query-(basic|site|url)</value\
> >
>   <description>Regular expression naming plugin directory names to
>   include.  Any plugin not matching this expression is excluded.  By
>   default Nutch includes crawling just HTML and plain text via HTTP,
>   and basic indexing and search plugins.
>   </description>
> </property>
> 
>  Step 2: make sure that the plugin is built:
> 
>   From the <nutch_home> directory, perform the following:
> 
>   # ensure that the core classes are built
>   % ant compile-core
> 
>   # ensure that the plugins are built
>   % ant compile-plugins
> 
> Note, that the compile-plugins task assumes that your plugin build info is
> in <nutch_home>/src/plugin/build.xml, so if you're building a new plugin,
> you'll have to add the ant compile info there, just follow the examples of
> the other plugins.
> 
>  Step 3: you're done.
> 
> 
> Good luck.
> 
> 
> Thanks,
>   Chris
> 
> 
> 
> ______________________________________________
> Chris A. Mattmann
> Chris.Mattmann@jpl.nasa.gov
> Staff Member
> Modeling and Data Management Systems Section (387)
> Data Management Systems and Technologies Group
> 
> _________________________________________________
> Jet Propulsion Laboratory            Pasadena, CA
> Office: 171-266B                        Mailstop:  171-246
> 
> _______________________________________________________
> 
> Disclaimer:  The opinions presented within are my own and do not reflect
> those of either NASA, JPL, or the California Institute of Technology.
> 
> 
> > -----Original Message-----
> > From: Jason Manfield [mailto:rarish911@yahoo.com]
> > Sent: Monday, May 02, 2005 10:24 AM
> > To: nutch-user@incubator.apache.org
> > Subject: How do I enable PDF/Word etc. parsing in nutch?
> >
> >
> > __________________________________________________
> > Do You Yahoo!?
> > Tired of spam?  Yahoo! Mail has the best spam protection around
> > http://mail.yahoo.com


Mime
View raw message