lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eric Pugh <ep...@opensourceconnections.com>
Subject Solr UIMA Notes
Date Fri, 10 Aug 2012 20:40:50 GMT
Hi all,

I've been working through the SolrUIMA demo, and have some changes to propose based on going
through it to make the UIMA stuff more accessible to a new user.  Since JIRA is down, I thought
I would email my notes to the list and see if anyone can clarify my questions. 

Eric


1) The class org.apache.lucene.analysis.uima.ae.OverridingParamsAEProvider specifically mentions
that it is used to take params supplied by Solr's solrconfig.xml and feed them into the AnalysisEngine.
 While no Solr imports exist, so it could be used with anything, it seems odd that the phrasing
for a Lucene class refers to Solr.  Changing the phrasing from "injecting runtime parameters
defined in the solrconfig.xml Solr configuration file" to "injecting runtime parameters such
as those defined in the Solr solrconfig.xml configuration file" might make the intent clearer
and explain why it isn't in a  Solr package, even though we have a Solr contrib module for
UIMA.

2) The tests org.apache.solr.uima.analysis.UIMAAnnotationsTokenizerFactoryTest and UIMATypeAwareAnnotationsTokenizerFactoryTest
test code that is in the o.a.lucene structure, but with all the overhead of using Solr.  There
is no corresponding test in the o.a.lucene path for those factory classes.  

3) When going through the http://wiki.apache.org/solr/SolrUIMA/ tutorial, it's very odd that
you flip from the wiki page to content that is stored in SVN and back as you follow the directions.
 Especially since the bits of sample config in SVN aren't used by tests or anything else.
 I'd like to move them to just the wiki, so they are easier to edit and keep up to date.

4) When looking at the test files we have annotation engines with names like "org.apache.solr.uima.ts.SentimentAnnotation".
 However, they don't exist as classes in the main source tree!  And when you go down the rabbit
hole, you eventually end up at a Java class called org.apache.solr.uima.processor.an.DummySentimentAnnotator
that actually is the aforementioned annotator!  I'd like to change the test code so that we
actually are at least using something called  "org.apache.solr.uima.ts.DummySentimentAnnotation"
or even "org.apache.solr.uima.processor.an.DummySentimentAnnotator"!    I got very excited
that out of the box demo had sentiment analysis, and it really didn't, just some mock code.

5) It appears that when you pass a multivalued field through to UIMA, only the first value
is actually submitted to Solr.  If my XML (solr.xml from example docs) looks like:

  <field name="features">Advanced Full-Text Search Capabilities using Lucene</field>
  <field name="features">Optimized for High Volume Web Traffic</field>

Then what gets processed is only the text "Advanced Full-Text Search Capabilities using Lucene"!
 I have a separate patch I will submit that uses getFieldValues() instead of getFieldValue()
method on a SolrInputDocument.

6) You need to bump your memory allocation!  -Xmx1024m -Xms512m, or it WILL run out of heap
space when running tests.

7) I'd like to move the UIMA xml files etc into the /conf directory, instead of accessing
the files that are inside the JAR file.  Much easier to hack on.  I copied solr/contrib/uima/src/resources/*.xml
into solr/example/solr/collection1/conf/uima, and access it via:
        <!--str name="analysisEngine">/org/apache/uima/desc/OverridingParamsExtServicesAE.xml</str-->
	
	<str name="analysisEngine">solr/${solr.core.instanceDir}/conf/uima/OverridingParamsExtServicesAE.xml</str>

8) It appears like for each annotation, I can only use the last "feature" defined.   This
doesn't work:
          <lst name="type">
            <str name="name">org.apache.uima.alchemy.ts.language.LanguageFS</str>
            <lst name="mapping">
              <str name="feature">language</str>
              <str name="field">language</str>
            </lst>
          </lst>		  		  
          <lst name="type">
            <str name="name">org.apache.uima.alchemy.ts.language.LanguageFS</str>
            <lst name="mapping">
              <str name="feature">wikipedia</str>
              <str name="field">language_wikipedia</str>
            </lst>
          </lst>


Okay, figured it out finally,  and it has to look like this inside a type definition:
            <lst name="mapping">
              <str name="feature">wikipedia</str>
              <str name="field">language_wikipedia</str>
            </lst>
            <lst name="mapping">
              <str name="feature">language</str>
              <str name="field">language</str>
            </lst>
            <lst name="mapping">
              <str name="feature">ethnologue</str>
			  <str name="fieldNameFeature">language</str>
              <str name="dynamicField">*_sm</str>
            </lst>
		


9) I'd like to patch the default solrconfig.xml to include the UIMA jars, and move the config
files over to /conf/uima, and then just comment out the example.  Do we think that this is
a good thing? Since you have to have an AlchemyAPI key, we could just have the code do the
sentence parsing as the example, and comment out the alchemyAPI keys in solrconfig.xml.  Or,
just leave them in the source tree, and document the steps?





-----------------------------------------------------
Eric Pugh | Principal | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com
Co-Author: Apache Solr 3 Enterprise Search Server available from http://www.packtpub.com/apache-solr-3-enterprise-search-server/book

This e-mail and all contents, including attachments, is considered to be Company Confidential
unless explicitly stated otherwise, regardless of whether attachments are marked as such.












---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message