manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erlend GarĂ¥sen <e.f.gara...@usit.uio.no>
Subject Filtering out unwanted content from HTML pages
Date Thu, 31 Mar 2011 14:41:02 GMT

All major commercial search engines are shipped with a web crawler which 
allows one to filter out unwanted content, such as certain html blocks, 
comments etc. Would it be advisable to add such a functionality to MCF? 
Or will it be difficult to implement since the idea behind the 
ExtractingRequestHandler is to send binary files to Solr?

Say that you have an HTML document which includes the following comments:
<!-- stop indexing -->
<!-- start indexing -->
All content within these comments should then be skipped from the index.

I managed to rewrite Apache Nutch in order to add this functionality for 
some months ago.

Erlend

-- 
Erlend GarĂ¥sen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050

Mime
View raw message