lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lance Norskog (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-2899) Add OpenNLP Analysis capabilities as a module
Date Wed, 06 Jun 2012 09:10:23 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13290040#comment-13290040
] 

Lance Norskog commented on LUCENE-2899:
---------------------------------------

An explanation about the OpenNLPUtil factory class: the statistical models are several megabytes
apiece. This class loads them and caches them by file name. It does not reload them across
commits. 

The models are immutable objects. The factory class creates another object that consults the
model. There is one of these for each field analysis. 

The models are large enough that if the different unit tests load them all at once, it needs
more than the default ram. Therefore, the unit tests unload all models between tests, and
only run single-threaded.


                
> Add OpenNLP Analysis capabilities as a module
> ---------------------------------------------
>
>                 Key: LUCENE-2899
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2899
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: modules/analysis
>            Reporter: Grant Ingersoll
>            Priority: Minor
>         Attachments: opennlp_trunk.patch
>
>
> Now that OpenNLP is an ASF project and has a nice license, it would be nice to have a
submodule (under analysis) that exposed capabilities for it. Drew Farris, Tom Morton and I
have code that does:
> * Sentence Detection as a Tokenizer (could also be a TokenFilter, although it would have
to change slightly to buffer tokens)
> * NamedEntity recognition as a TokenFilter
> We are also planning a Tokenizer/TokenFilter that can put parts of speech as either payloads
(PartOfSpeechAttribute?) on a token or at the same position.
> I'd propose it go under:
> modules/analysis/opennlp

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message