lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lance Norskog (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (LUCENE-2899) Add OpenNLP Analysis capabilities as a module
Date Mon, 02 Jul 2012 06:11:01 GMT

     [ https://issues.apache.org/jira/browse/LUCENE-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Lance Norskog updated LUCENE-2899:
----------------------------------

    Attachment: LUCENE-2899.patch

This is about finished. The Tokenizer and TokenFilters are moved over into lucene/analysis/opennlp.
They do not have unit tests in lucene/ because of the difficulty in supplying model data.
They are unit-tested by the factories in solr/contrib/opennlp.

The solr/example/opennlp directory is gone, as per request. Possible field types are documented
in the solrconfig.xml in the unit test resources.

All jars are downloaded via ivy. The jwnl library is one rev after what this was compiled
with. It is only used in collocation, which is not exposed in this release.

To build, test and commit, there is a boostrap sequence. In the top-level directory:
{code}
  ant clean compile
{code}
This downloads the OpenNLP jars
{code}
cd solr/contrib/opennlp/test-files/training
sh bin/training.sh
{code}
This creates low-quality model files in {{solr/contrib/opennlp/src/test-files/opennlp/solr/collection1/conf/opennlp}}.
In the trunk/solr directory, run
{code} 
ant example test-contrib
{code}
You now have committable binary models. They are small, and only there to run the OpenNLP
unit tests. They generate results that are objectively bogus, but the unit tests are matched
to the results. If you want real models, you have to download them from sourceforge.
                
> Add OpenNLP Analysis capabilities as a module
> ---------------------------------------------
>
>                 Key: LUCENE-2899
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2899
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: modules/analysis
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>            Priority: Minor
>         Attachments: LUCENE-2899.patch, LUCENE-2899.patch, LUCENE-2899.patch, opennlp_trunk.patch
>
>
> Now that OpenNLP is an ASF project and has a nice license, it would be nice to have a
submodule (under analysis) that exposed capabilities for it. Drew Farris, Tom Morton and I
have code that does:
> * Sentence Detection as a Tokenizer (could also be a TokenFilter, although it would have
to change slightly to buffer tokens)
> * NamedEntity recognition as a TokenFilter
> We are also planning a Tokenizer/TokenFilter that can put parts of speech as either payloads
(PartOfSpeechAttribute?) on a token or at the same position.
> I'd propose it go under:
> modules/analysis/opennlp

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message