manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karl Wright (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CONNECTORS-1482) Mime type exclusion and document length exclusion in Solr output connector don't apparently work
Date Tue, 09 Jan 2018 14:41:03 GMT

    [ https://issues.apache.org/jira/browse/CONNECTORS-1482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16318515#comment-16318515
] 

Karl Wright commented on CONNECTORS-1482:
-----------------------------------------

The mime type exclusion is done as follows:

{code}
  /** Detect if a mime type is indexable or not.  This method is used by participating repository
connectors to pre-filter the number of
  * unusable documents that will be passed to this output connector.
  *@param outputDescription is the document's output version.
  *@param mimeType is the mime type of the document.
  *@return true if the mime type is indexable by this connector.
  */
  @Override
  public boolean checkMimeTypeIndexable(VersionContext outputDescription, String mimeType,
IOutputCheckActivity activities)
    throws ManifoldCFException, ServiceInterruption
  {
    getSession();
    if (useExtractUpdateHandler)
    {
      if (includedMimeTypes != null && includedMimeTypes.get(mimeType) == null)
        return false;
      if (excludedMimeTypes != null && excludedMimeTypes.get(mimeType) != null)
        return false;
      return true;
    }
    return acceptableMimeTypes.contains(mimeType.toLowerCase(Locale.ROOT));
  }
{code}

Some things to note about this.  First, you can only exclude mime types if you are using the
extracting update handler.  This explains why the standard handler doesn't do it.  Second,
the check is case sensitive, which is a problem in my opinion.  That's easily fixed though.
 Third, this is used ONLY to tell the upstream connector not to send the document, so it can
potentially be ignored if the upstream connector doesn't play along.  A hard check really
ought to be added in HttpPoster.


> Mime type exclusion and document length exclusion in Solr output connector don't apparently
work
> ------------------------------------------------------------------------------------------------
>
>                 Key: CONNECTORS-1482
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1482
>             Project: ManifoldCF
>          Issue Type: Bug
>          Components: Lucene/SOLR connector
>    Affects Versions: ManifoldCF 2.9
>            Reporter: Karl Wright
>            Assignee: Karl Wright
>             Fix For: ManifoldCF 2.10
>
>         Attachments: problem_documents_connector.png, problem_documents_connector_solr.png,
problem_documents_connector_solr_stream_size.png
>
>
> See attached images.  Setting exclusions apparently does not prevent documents with that
mime type from being included.  This may be because of regexp characters etc but it needs
to be researched and documented at least.  Also, the length limitation doesn't seem to be
working either.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message