manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karl Wright (JIRA)" <j...@apache.org>
Subject [jira] [Assigned] (CONNECTORS-1547) No activity record for for excluded documents in WebCrawlerConnector
Date Wed, 17 Oct 2018 14:37:00 GMT

     [ https://issues.apache.org/jira/browse/CONNECTORS-1547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Karl Wright reassigned CONNECTORS-1547:
---------------------------------------

    Assignee: Karl Wright

> No activity record for for excluded documents in WebCrawlerConnector
> --------------------------------------------------------------------
>
>                 Key: CONNECTORS-1547
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1547
>             Project: ManifoldCF
>          Issue Type: Bug
>          Components: Web connector
>            Reporter: Olivier Tavard
>            Assignee: Karl Wright
>            Priority: Minor
>         Attachments: manifoldcf_local_files.log, manifoldcf_web.log, simple_history_files.jpg,
simple_history_web.jpg
>
>
> Hi,
> I noticed that there is no activity record logged for documents excluded by the Document
Filter transformation connector  in the WebCrawler connector.
> To reproduce the issue on MCF out of the box :
> Null output connector 
> Web repository connector 
> Job :
> - DocumentFilter added which only accepts application/msword (doc/docx) documents
> The simple history does not mention the documents excluded (excepted for html documents).
They have fetch activity and that's all (see simple_history_web.jpeg).
> We can only see the documents excluded by the MCF log (with DEBUG verbosity activity
on connectors) :
> {code:java}
> Removing url 'https://www.datafari.com/assets/img/Logo_Datafari_4_Condensed_No_D_20180606_30x30.png'
because it had the wrong content type ('image/png'){code}
> (see manifoldcf_local_files.log)
> The related code is in WebcrawlerConnector.java l.904 :
> {code:java}
> fetchStatus.contextMessage = "it had the wrong content type ('"+contentType+"')";
>  fetchStatus.resultSignal = RESULT_NO_DOCUMENT;
>  activityResultCode = null;{code}
> The activityResultCode is null.
>  
>  
> If we configure the same job but for a Local File system connector with the same Document
Filter transformation connector, the simple history mentions all the documents excluded in
the simple history (see simple_history_files.jpeg)  and the code mentions a specific error
code with an activity record logged (class FileConnector l. 415) : 
> {code:java}
> if (!activities.checkMimeTypeIndexable(mimeType))
>  {
>  errorCode = activities.EXCLUDED_MIMETYPE;
>  errorDesc = "Excluded because mime type ('"+mimeType+"')";
>  Logging.connectors.debug("Skipping file '"+documentIdentifier+"' because mime type ('"+mimeType+"')
was excluded by output connector.");
>  activities.noDocument(documentIdentifier,versionString);
>  continue;
>  }{code}
>  
> So the Web Crawler connector should have the same behaviour than for FileConnector and
explicitly mention all the documents excluded by the user I think.
>  
> Best regards,
> Olivier



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message