lucene-solr-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <>
Subject [Solr Wiki] Update of "OpenNLP" by LanceXNorskog
Date Thu, 05 Jul 2012 00:59:27 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The "OpenNLP" page has been changed by LanceXNorskog:

  This example should work well with most English-language free text.
  == Installation ==
- See the patch for more information. The short story is you have to download statistical
models from sourceforge to make OpenNLP work- the models do not have an Apache-compatible
+ For English language testing:
+ Until SOLR-2899 is committed:
+ * pull the latest trunk or 4.0 branch
+ * apply the patch 
+ * do 'ant compile'
+ * cd solr/contrib/opennlp/src/test-files/training 
+ * run 'bin/'
+ ** this will create binary files which will be included in the distribution when committed.
+ Now, go to trunk-dir/solr and run 'ant test-contrib'. It compiles the OpenNLP lucene and
solr code against the OpenNLP libraries and uses the small model files. 
+ Deployment to Solr
+ A Solr core requires schema types for the OpenNLP Tokenizer & Filter, and also requires
model files.  The distribution includes a schema.xml file in solr/contrib/opennlp/src/test-files/opennlp/solr/conf/
which demonstrates OpenNLP-based analyzers. It does not contain other text types (to avoid
falling out of date with the full text suite). You should copy the text types from this file
into your test collection schema.xml, and download "real" models for testing. Also, you may
have to add the OpenNLP lib directory to your solr/lib or solr/cores/collection/lib directory.
+ Now, download these model files to solr/contrib/opennlp/src/test-files/opennlp/solr/conf/opennlp/
+ * []
+ * The English-language models start with 'en'. The 'maxent' models are preferred to the
'perceptron' models.
+ Your Solr should start without any Exceptions. At this point, go to the Schema analyzer,
pick the 'text_opennlp_pos' field type, and post a sentence or two to the analyzer. You should
get text tokenized with payloads. Unfortunately, the analysis page shows them as bytes instead
of text. If you would like this in text form, then go vote on SOLR-3493.
+ Licensing
+ The OpenNLP library is Apache. The 'jwnl' library is 'BSD-like'. 
+ Model licensing:
+ * The contrib directory includes some small training data and scripts to generate model
files. These are supplied only for running "unit" tests aginst the complete Solr/Lucene/OpenNLP
code assemblies. They are not useful for * exploring OpenNLP's features or for production
deployment. In solr/contrib/opennlp/src/test-files/training, run 'bin/' to populate
solr/contrib/opennlp/src/test-files/opennlp/solr/conf/opennlp with the test models. The schema.xml
in that conf/ directory uses those models.
+ * The models available from Sourceforge are created from licensed training data. I have
not seen a formal description of their license status, but they are not "safe" for Apache.
If you want production-quality models for commercial use, you will need to make other arrangements.
is you have to download statistical models from sourceforge to make OpenNLP work- the models
do not have an Apache-compatible license.

View raw message