manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karl Wright (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CONNECTORS-1482) Mime type exclusion and document length exclusion in Solr output connector don't apparently work
Date Wed, 10 Jan 2018 17:13:00 GMT

    [ https://issues.apache.org/jira/browse/CONNECTORS-1482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16320660#comment-16320660
] 

Karl Wright commented on CONNECTORS-1482:
-----------------------------------------

[~schuch], the *only* mime type that the Tika Extractor sets for a document is "text/plain".
 If you want to filter documents based on their *original* mime type, you must do it *before*
the Tika Extractor in your pipeline.

> Mime type exclusion and document length exclusion in Solr output connector don't apparently
work
> ------------------------------------------------------------------------------------------------
>
>                 Key: CONNECTORS-1482
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1482
>             Project: ManifoldCF
>          Issue Type: Bug
>          Components: Lucene/SOLR connector
>    Affects Versions: ManifoldCF 2.9
>            Reporter: Karl Wright
>            Assignee: Karl Wright
>             Fix For: ManifoldCF 2.10
>
>         Attachments: problem_documents_connector.png, problem_documents_connector_solr.png,
problem_documents_connector_solr_stream_size.png
>
>
> See attached images.  Setting exclusions apparently does not prevent documents with that
mime type from being included.  This may be because of regexp characters etc but it needs
to be researched and documented at least.  Also, the length limitation doesn't seem to be
working either.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message