manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karl Wright (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CONNECTORS-1270) Import OpenNLP connector into trunk
Date Wed, 27 Jan 2016 11:45:40 GMT

    [ https://issues.apache.org/jira/browse/CONNECTORS-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15119102#comment-15119102
] 

Karl Wright commented on CONNECTORS-1270:
-----------------------------------------

[~rafaharo]: It sounds ridiculous to worry about large documents until you realize that some
people ingest millions of documents and would be very hard pressed to find the few ones of
those that contain hundreds of megabytes of text.  We really can't make the assumption that
it is safe to load that much document into memory.

The approach I took with the Tika Extractor was to load the document into memory if it was
less than a certain size, otherwise it has to go to disk.  I've used the same strategy here.

As for the NLP processing, we have a choice: either (1) process only the first N characters,
or (2) process using a rolling buffer, and try to algorithmically remove any sentence fragments
that we find because of our buffer approach.  Right now, I'm leaning towards (1).  I should
have that done by the end of the day.



> Import OpenNLP connector into trunk
> -----------------------------------
>
>                 Key: CONNECTORS-1270
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1270
>             Project: ManifoldCF
>          Issue Type: Task
>            Reporter: Karl Wright
>            Assignee: Rafa Haro
>             Fix For: ManifoldCF 2.4
>
>
> An OpenNLP connector has been contributed on github.  Need to import it into MCF, first
to a branch, then to trunk.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message