manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karl Wright (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CONNECTORS-1193) Consider adding feature to web connector to skip pages that match specified criteria
Date Mon, 25 May 2015 16:02:17 GMT

    [ https://issues.apache.org/jira/browse/CONNECTORS-1193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14558352#comment-14558352
] 

Karl Wright commented on CONNECTORS-1193:
-----------------------------------------

Hi Arcadius,

Definitely getting closer.

Two comments:

{quote}
+      //pack(versionBuffer,excludesContentIndex,'+'); Should this be added here?
{quote}

Answer: yes.

For the following methods:

{code}
+    public boolean isDocumentContentIndexable(String documentIdentifier) throws ManifoldCFException
{
+      if (excludeContentIndexPatterns.isEmpty()) {
+        if (Logging.connectors.isDebugEnabled())
+          Logging.connectors.debug("WEB: no content exclusion rule supplied... returning");
+        return true;
+      }
+
+      for (Pattern p : excludeContentIndexPatterns) {
+        String content = findSpecifiedContent(documentIdentifier, p);
+        if (content != null) {
+          if (Logging.connectors.isDebugEnabled())
+            Logging.connectors.debug("WEB: Url '" + documentIdentifier + "' is not indexable
because content exclude pattern '" + p.toString() + "' matched it");
+
+          return false;
+        }
+      }
+      return true;
+    }
+
+    protected String findSpecifiedContent(String currentURI, Pattern pattern) throws ManifoldCFException
+    {
+      if (pattern == null )
+        return null;
+
+      FindContentHandler handler = new FindContentHandler(currentURI,pattern);
+      handleHTML(currentURI, handler);
+      return handler.getTargetURI();
     }
{code}

... I think it would be better to write your own equivalent of FindContentHandler which takes
an array of patterns, not just one.  Otherwise, once again, it will be slow.

Other than that, I think you are done.

> Consider adding feature to web connector to skip pages that match specified criteria
> ------------------------------------------------------------------------------------
>
>                 Key: CONNECTORS-1193
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1193
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: Web connector
>    Affects Versions: ManifoldCF 1.10, ManifoldCF 2.2
>            Reporter: Karl Wright
>            Assignee: Karl Wright
>             Fix For: ManifoldCF 1.10, ManifoldCF 2.2
>
>         Attachments: CONNECTORS-1193.patch
>
>
> The user wants to skip content that matches specified criteria, because some sites don't
return a 404 code (for instance) but instead return 200 with a textual error message.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message