manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karl Wright (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CONNECTORS-1482) Mime type exclusion and document length exclusion in Solr output connector don't apparently work
Date Tue, 09 Jan 2018 14:33:00 GMT

    [ https://issues.apache.org/jira/browse/CONNECTORS-1482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16318497#comment-16318497
] 

Karl Wright commented on CONNECTORS-1482:
-----------------------------------------

The length exclusion code is trivial and hard to bypass, unless the HttpPoster object is created
incorrectly:

{code}
  /**
  * Post the input stream to ingest
  *
   * @param documentURI is the document's uri.
   * @param document is the document structure to ingest.
   * @param arguments are the configuration arguments to pass in the post.  Key is argument
name, value is a list of the argument values.
   * @param authorityNameString is the name of the governing authority for this document's
acls, or null if none.
   * @param activities is the activities object, so we can report what's happening.   @return
true if the ingestion was successful, or false if the ingestion is illegal.
  * @throws ManifoldCFException, ServiceInterruption
  */
  public boolean indexPost(String documentURI,
    RepositoryDocument document, Map<String,List<String>> arguments,
    String authorityNameString, IOutputAddActivity activities)
    throws ManifoldCFException, ServiceInterruption
  {
    if (Logging.ingest.isDebugEnabled())
      Logging.ingest.debug("indexPost(): '" + documentURI + "'");

    // If the document is too long, reject it.
    if (maxDocumentLength != null && document.getBinaryLength() > maxDocumentLength.longValue()){
      activities.recordActivity(null,SolrConnector.INGEST_ACTIVITY,null,documentURI,activities.EXCLUDED_LENGTH,"Solr
connector rejected document due to its big size: ('"+document.getBinaryLength()+"')");
      return false;
    }
{code}


> Mime type exclusion and document length exclusion in Solr output connector don't apparently
work
> ------------------------------------------------------------------------------------------------
>
>                 Key: CONNECTORS-1482
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1482
>             Project: ManifoldCF
>          Issue Type: Bug
>          Components: Lucene/SOLR connector
>    Affects Versions: ManifoldCF 2.9
>            Reporter: Karl Wright
>            Assignee: Karl Wright
>             Fix For: ManifoldCF 2.10
>
>         Attachments: problem_documents_connector.png, problem_documents_connector_solr.png,
problem_documents_connector_solr_stream_size.png
>
>
> See attached images.  Setting exclusions apparently does not prevent documents with that
mime type from being included.  This may be because of regexp characters etc but it needs
to be researched and documented at least.  Also, the length limitation doesn't seem to be
working either.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message