manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karl Wright (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CONNECTORS-1425) Add Tika server option to Tika connector
Date Thu, 11 May 2017 00:58:04 GMT

    [ https://issues.apache.org/jira/browse/CONNECTORS-1425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16005735#comment-16005735
] 

Karl Wright commented on CONNECTORS-1425:
-----------------------------------------

[~julienFL], I had to make a couple of changes to get this to build, but that was straightforward.

However, here are my other comments.

(1) There really should not be any thread waiting/retrying in the connector.  If there's an
error that should be retried, the proper thing to do is to throw a ServiceInterruption with
the appropriate parameters and let the retry happen that way:

{code}
            while (retry < 3 && response == null) {
              try {
                response = client.execute(tikaHost, httpPut);
                tikaServerDownException = null;
              } catch (IOException e) {
                tikaServerDownException = e;
                retry++;
                if (retry < 3) {
                  try {
                    Thread.sleep(sp.tikaRetry);
                  } catch (InterruptedException e1) {
                    // Should not happen
                  }
                }
              }
            }
{code}

The retry delay (sp.tikaRetry) can be used to construct that ServiceInterruption, and the
count of 3 can be also provided as a parameter.

(2) The new code is definitely not backwards compatible with the old code, for example this
mapping of punctuation to '_'s in metadata names when lowercase is specified is new:

{code}
               if (!Character.isLetterOrDigit(ch))
                 ch = '_';
               else
                 ch = Character.toLowerCase(ch);
{code}

(3) When the remote service is called, the boilerpipe extractions are *not* permitted, and
will not take place.  This is actually a major deal because the functionality when the server
is used differs significantly from when it isn't used.

(4) Connection pooling with HttpClient is not done in such a way as to control the number
of connections and connection pools.  This logic should be taken from other connectors that
manage HttpClient connections so that there is a pool per transformation connection and the
pool size is 1.

I'm therefore considering a number of options here.  First, undoing the backwards-compatibility
breaking issues would seem to be important, as would fixing the pooling problems and thread
stall issues.  But to solve (3) I am wondering whether a better approach might not be to have
two entirely separate Tika connectors - one internal, and the other external (but with fewer
features).  Let me ponder this and come up with a broader plan.


> Add Tika server option to Tika connector 
> -----------------------------------------
>
>                 Key: CONNECTORS-1425
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1425
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: Tika extractor
>    Affects Versions: ManifoldCF 2.6
>            Reporter: Julien Massiera
>            Assignee: Karl Wright
>            Priority: Minor
>             Fix For: ManifoldCF 2.8
>
>         Attachments: CONNECTORS-1425.patch
>
>
> It is a modification that allow users to switch between the embedded Tika parsers and
a call to a Tika Server to extract content & metadata of a document.
> These modifications are backward compatibles with the current Tika connector



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message