manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Arcadius Ahouansou (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CONNECTORS-1193) Consider adding feature to web connector to skip pages that match specified criteria
Date Wed, 06 May 2015 03:00:59 GMT

    [ https://issues.apache.org/jira/browse/CONNECTORS-1193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14529801#comment-14529801
] 

Arcadius Ahouansou commented on CONNECTORS-1193:
------------------------------------------------

Hello [~daddywri]]

(1) Regex could do. However, I would personally prefer a dictionary. Let's see what other
think
(2) Ideally, we should not impose any limitation on this. But to start, we could set e default
configurable max
(3) same for regex. We could set a default allowed max
(4) This could be inserted at any stage in the pipeline meaning that if this filter is inserted
after html->text, then it will ignore tags, if inserted after tikka extract text from PDF/doc,
it will use the text content to make decision etc.

I understand that there are some concerns about memory usage when processing large documents
and that's the reason for preferring line-by-line processing.
>From my understanding, Tikka and the BoilerPipe extractor in MCF may already be opening
full document stream for processing.


> Consider adding feature to web connector to skip pages that match specified criteria
> ------------------------------------------------------------------------------------
>
>                 Key: CONNECTORS-1193
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1193
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: Web connector
>    Affects Versions: ManifoldCF 1.10, ManifoldCF 2.2
>            Reporter: Karl Wright
>            Assignee: Karl Wright
>             Fix For: ManifoldCF 1.10, ManifoldCF 2.2
>
>
> The user wants to skip content that matches specified criteria, because some sites don't
return a 404 code (for instance) but instead return 200 with a textual error message.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message