manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Markus Schuch (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (CONNECTORS-1571) Web Crawler Connector checks different MIME type than it is sending down the pipeline
Date Mon, 14 Jan 2019 14:25:00 GMT

     [ https://issues.apache.org/jira/browse/CONNECTORS-1571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Markus Schuch updated CONNECTORS-1571:
--------------------------------------
    Affects Version/s: ManifoldCF 2.10

> Web Crawler Connector checks different MIME type than it is sending down the pipeline
> -------------------------------------------------------------------------------------
>
>                 Key: CONNECTORS-1571
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1571
>             Project: ManifoldCF
>          Issue Type: Bug
>          Components: Web connector
>    Affects Versions: ManifoldCF 2.10
>            Reporter: Markus Schuch
>            Priority: Minor
>
> The Web Crawler Connector extracts the MIME type from the request Content-Type header.
> Then it truncates the possible {{charset=whatever_encoding}} and lets the pipeline check
if the resulting MIME type (without the charset) {{activities.checkMimeTypeIndexable(contentType);}}
should be ingested.
> When sending the actual {{RepositoryDocument}} it sets the full MIME type (with the charset)
in the document. This is no major bug, but a small inconsistency since the HttpPoster of the
Solr Output Connector performs a "hard" check of the MIME type again which can have different
outcome than the preceding check activity.
> I think this was introduced or (better) revealed with CONNECTORS-1482.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message