manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karl Wright (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CONNECTORS-1270) Import OpenNLP connector into trunk
Date Wed, 27 Jan 2016 06:10:39 GMT

    [ https://issues.apache.org/jira/browse/CONNECTORS-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15118714#comment-15118714
] 

Karl Wright commented on CONNECTORS-1270:
-----------------------------------------

Hmm, looking at the code, there's also this problem:

{code}
    byte[] bytes = IOUtils.toByteArray(document.getBinaryStream());
...
    // reset original stream
    docCopy.setBinary(new ByteArrayInputStream(bytes), bytes.length);
{code}

This is usually unacceptable for a production connector; the entire document has to be loaded
into memory here, and that won't work, because memory consumption has to be bounded.  Unfortunately,
looking at the OpenNLP SentenceDetector API, there isn't any support for streaming:

https://opennlp.apache.org/documentation/1.6.0/apidocs/opennlp-tools/opennlp/tools/sentdetect/SentenceDetector.html

About the only thing we can reasonably do is the "rolling buffer" approach, where we page
in some chunk of document (e.g. 64K), do sentence detection on that, then chuck the first
3/4, and page in another 64K, doing overlapping sentence detection with that last chunk and
the next chunk, and where we detect overlapping sentences at the start of each subsequent
chunk and chuck them.  I'm not worried about run-on sentences here.

As for the need to replay the content stream, the best way to do that is to create a duplicate
of the original RepositoryDocument object using the standard MCF support for that, add the
new metadata, and close it off after it has been handed downstream.

Both the proposed sets of changes seem critical to me, and they're not terribly easy either.
 The actual flow change (so documents don't hit memory completely) I will tackle, but there
is still a lot to do even when that's done.







> Import OpenNLP connector into trunk
> -----------------------------------
>
>                 Key: CONNECTORS-1270
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1270
>             Project: ManifoldCF
>          Issue Type: Task
>            Reporter: Karl Wright
>            Assignee: Rafa Haro
>             Fix For: ManifoldCF 2.4
>
>
> An OpenNLP connector has been contributed on github.  Need to import it into MCF, first
to a branch, then to trunk.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message