manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Filtering out unwanted content from HTML pages
Date Thu, 31 Mar 2011 14:56:03 GMT
This is a good question.  I think we should carry this conversation
forward on connectors-dev.

My initial thought on this issue is that the functionality really
belongs in Tika.  Tika is set up to extract and filter in exactly this
way.  The only reason you'd want to do it in MCF is if it would change
the links you might extract (or, skip), and that seems to me less
interesting.  How do you feel about it?

Karl

On Thu, Mar 31, 2011 at 10:41 AM, Erlend GarĂ¥sen
<e.f.garasen@usit.uio.no> wrote:
>
> All major commercial search engines are shipped with a web crawler which
> allows one to filter out unwanted content, such as certain html blocks,
> comments etc. Would it be advisable to add such a functionality to MCF? Or
> will it be difficult to implement since the idea behind the
> ExtractingRequestHandler is to send binary files to Solr?
>
> Say that you have an HTML document which includes the following comments:
> <!-- stop indexing -->
> <!-- start indexing -->
> All content within these comments should then be skipped from the index.
>
> I managed to rewrite Apache Nutch in order to add this functionality for
> some months ago.
>
> Erlend
>
> --
> Erlend GarĂ¥sen
> Center for Information Technology Services
> University of Oslo
> P.O. Box 1086 Blindern, N-0317 OSLO, Norway
> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
>

Mime
View raw message