manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karl Wright (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CONNECTORS-1193) Consider adding feature to web connector to skip pages that match specified criteria
Date Thu, 21 May 2015 06:26:59 GMT

    [ https://issues.apache.org/jira/browse/CONNECTORS-1193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14553705#comment-14553705
] 

Karl Wright commented on CONNECTORS-1193:
-----------------------------------------

Hi Arcadius,

(1) The UI portions of the patch look good to me.
(2) For the actual processing, it looks to me like you are loading the entire extracted content
for the page into memory.  That's never going to work.  Whether you use tika to extract the
content or just fuzzyml, you will have to use various tricks to look for the content in a
stream rather than in a giant string.  There is other code already in the Webconnector that
does this; you might want to model your code on it.
(3) I think involving Tika or fuzzyml in every web fetch decision as a matter of course is
also a non-starter.  It would probably reduce the performance of the web connector by an order
of magnitude.  In general, I would greatly prefer that if the user has specified no content
to be excluded, then no extra parsing work happens.
(4) Using tika and thus dealing with all kinds binary content is probably also not going to
work, for performance reasons.  People crawl *very* large binary documents.  Web documents
are typically limited in size because they need to be displayed in a browser.  You could fix
this in one of two ways: either only look at html content with fuzzyml (which would cover
your initial use case completely), or you could limit the total characters on every document
to some maximum number you set as part of the document specification.  I don't think you've
made a compelling case for using Tika yet though.

As for integration testing, you have two possibilities.  The first is simply to count documents.
 That does not guarantee that the correct one(s) were excluded, but it's usually reasonable
to assume it if the cardinality is what you would expect.  The second is to get more detailed
by looking at the simple history report, which you can run via java api within your test.
 This should give you a precise idea of what was included and what was rejected.

Thanks!




> Consider adding feature to web connector to skip pages that match specified criteria
> ------------------------------------------------------------------------------------
>
>                 Key: CONNECTORS-1193
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1193
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: Web connector
>    Affects Versions: ManifoldCF 1.10, ManifoldCF 2.2
>            Reporter: Karl Wright
>            Assignee: Karl Wright
>             Fix For: ManifoldCF 1.10, ManifoldCF 2.2
>
>         Attachments: CONNECTORS-1193.patch
>
>
> The user wants to skip content that matches specified criteria, because some sites don't
return a 404 code (for instance) but instead return 200 with a textual error message.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message