manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karl Wright (JIRA)" <>
Subject [jira] [Commented] (CONNECTORS-1193) Consider adding feature to web connector to skip pages that match specified criteria
Date Wed, 06 May 2015 11:08:00 GMT


Karl Wright commented on CONNECTORS-1193:

bq. This could be inserted at any stage in the pipeline meaning that if this filter is inserted
after html->text

That only makes sense if you aren't adding the feature to the web connector after all, but
rather to a general content filter transformation connector.  A general content filter transformation
connector would be able to work in multiple ways, yes -- and with noted performance loss --
but it would appear to me that this functionality is primarily applicable to web crawling.
 Even the RSS connector does not seem to require this kind of filtering.

If you accept this reasoning and want to do this functionality in the web connector itself,
then we would probably make it work much like the content match feature for session authentication,
which matches only actual content, since it parses any HTML tags.  There would be no ability,
therefore, to filter binary documents based on their contents, or filter documents based on
their tag structure.

> Consider adding feature to web connector to skip pages that match specified criteria
> ------------------------------------------------------------------------------------
>                 Key: CONNECTORS-1193
>                 URL:
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: Web connector
>    Affects Versions: ManifoldCF 1.10, ManifoldCF 2.2
>            Reporter: Karl Wright
>            Assignee: Karl Wright
>             Fix For: ManifoldCF 1.10, ManifoldCF 2.2
> The user wants to skip content that matches specified criteria, because some sites don't
return a 404 code (for instance) but instead return 200 with a textual error message.

This message was sent by Atlassian JIRA

View raw message