lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tommaso Teofili <tommaso.teof...@gmail.com>
Subject Re: Solr UIMA Notes
Date Mon, 27 Aug 2012 21:25:51 GMT
Hi Eric,

2012/8/10 Eric Pugh <epugh@opensourceconnections.com>

> Hi all,
>
> I've been working through the SolrUIMA demo, and have some changes to
> propose based on going through it to make the UIMA stuff more accessible to
> a new user.  Since JIRA is down, I thought I would email my notes to the
> list and see if anyone can clarify my questions.
>
> Eric
>
>
> 1) The class org.apache.lucene.analysis.uima.ae.OverridingParamsAEProvider
> specifically mentions that it is used to take params supplied by Solr's
> solrconfig.xml and feed them into the AnalysisEngine.  While no Solr
> imports exist, so it could be used with anything, it seems odd that the
> phrasing for a Lucene class refers to Solr.  Changing the phrasing from
> "injecting runtime parameters defined in the solrconfig.xml Solr
> configuration file" to "injecting runtime parameters such as those defined
> in the Solr solrconfig.xml configuration file" might make the intent
> clearer and explain why it isn't in a  Solr package, even though we have a
> Solr contrib module for UIMA.
>

yep, it's due to the fact that those o.a.lucene.uima.ae classes where Solr
"citizens" while when we created the UIMA tokenizers we realized that it
was good to have the factory classes available for both therefore they were
moved to lucene/analysis/uima but you're right the javadoc should be
adjusted.


>
> 2) The tests
> org.apache.solr.uima.analysis.UIMAAnnotationsTokenizerFactoryTest and
> UIMATypeAwareAnnotationsTokenizerFactoryTest test code that is in the
> o.a.lucene structure, but with all the overhead of using Solr.  There is no
> corresponding test in the o.a.lucene path for those factory classes.
>

these two tests are explicitly for the Solr factories that are meant to be
declared in a Solr schema, the tests in the lucene/analysis/uima module
are UIMABaseAnalyzerTest (for UIMAAnnotationsTokenizer generated Analyzer)
and UIMATypeAwareAnalyzerTest (for the TypeAware related Analyzer).


>
> 3) When going through the http://wiki.apache.org/solr/SolrUIMA/ tutorial,
> it's very odd that you flip from the wiki page to content that is stored in
> SVN and back as you follow the directions.  Especially since the bits of
> sample config in SVN aren't used by tests or anything else.  I'd like to
> move them to just the wiki, so they are easier to edit and keep up to date.
>

+1


>
> 4) When looking at the test files we have annotation engines with names
> like "org.apache.solr.uima.ts.SentimentAnnotation".  However, they don't
> exist as classes in the main source tree!  And when you go down the rabbit
> hole, you eventually end up at a Java class called
> org.apache.solr.uima.processor.an.DummySentimentAnnotator that actually is
> the aforementioned annotator!  I'd like to change the test code so that we
> actually are at least using something called
>  "org.apache.solr.uima.ts.DummySentimentAnnotation" or even
> "org.apache.solr.uima.processor.an.DummySentimentAnnotator"!    I got very
> excited that out of the box demo had sentiment analysis, and it really
> didn't, just some mock code.
>

maybe just changing SentimentAnnotation to DummySentimentAnnotation would
make things more consistent and avoid confusion.


>
> 5) It appears that when you pass a multivalued field through to UIMA, only
> the first value is actually submitted to Solr.  If my XML (solr.xml from
> example docs) looks like:
>
>   <field name="features">Advanced Full-Text Search Capabilities using
> Lucene</field>
>   <field name="features">Optimized for High Volume Web Traffic</field>
>
> Then what gets processed is only the text "Advanced Full-Text Search
> Capabilities using Lucene"!  I have a separate patch I will submit that
> uses getFieldValues() instead of getFieldValue() method on a
> SolrInputDocument.
>

this sounds like a bug, if you want to open a Jira issue / submit a patch
you're more than welcome, otherwise I can do that.


>
> 6) You need to bump your memory allocation!  -Xmx1024m -Xms512m, or it
> WILL run out of heap space when running tests.
>

I was not aware of that, I'll give it a try with a very small heap.


>
> 7) I'd like to move the UIMA xml files etc into the /conf directory,
> instead of accessing the files that are inside the JAR file.  Much easier
> to hack on.  I copied solr/contrib/uima/src/resources/*.xml into
> solr/example/solr/collection1/conf/uima, and access it via:
>         <!--str
> name="analysisEngine">/org/apache/uima/desc/OverridingParamsExtServicesAE.xml</str-->
>         <str
> name="analysisEngine">solr/${solr.core.instanceDir}/conf/uima/OverridingParamsExtServicesAE.xml</str>
>

ok, sounds good even if the mentioned file is in
src/org/apache/uima/desc/resources which can be edited easily for "playing"
with the tests.


>
> 8) It appears like for each annotation, I can only use the last "feature"
> defined.   This doesn't work:
>           <lst name="type">
>             <str
> name="name">org.apache.uima.alchemy.ts.language.LanguageFS</str>
>             <lst name="mapping">
>               <str name="feature">language</str>
>               <str name="field">language</str>
>             </lst>
>           </lst>
>           <lst name="type">
>             <str
> name="name">org.apache.uima.alchemy.ts.language.LanguageFS</str>
>             <lst name="mapping">
>               <str name="feature">wikipedia</str>
>               <str name="field">language_wikipedia</str>
>             </lst>
>           </lst>
>
>
> Okay, figured it out finally,  and it has to look like this inside a type
> definition:
>             <lst name="mapping">
>               <str name="feature">wikipedia</str>
>               <str name="field">language_wikipedia</str>
>             </lst>
>             <lst name="mapping">
>               <str name="feature">language</str>
>               <str name="field">language</str>
>             </lst>
>             <lst name="mapping">
>               <str name="feature">ethnologue</str>
>                           <str name="fieldNameFeature">language</str>
>               <str name="dynamicField">*_sm</str>
>             </lst>
>
>
sure the latter is how it's supposed to work, as features are related to
one single type.


>
>
> 9) I'd like to patch the default solrconfig.xml to include the UIMA jars,
> and move the config files over to /conf/uima, and then just comment out the
> example.  Do we think that this is a good thing? Since you have to have an
> AlchemyAPI key, we could just have the code do the sentence parsing as the
> example, and comment out the alchemyAPI keys in solrconfig.xml.  Or, just
> leave them in the source tree, and document the steps?
>

I assume that just adding the elements for importing the libs could be ok,
we should instead avoid adding the AlchemyAPI AE by default due to the key
setting.
I think the best option is open separate Jira tickets for the above tasks
and discuss them more deeply there.
Thanks for your effort Eric.

Regards,
Tommaso


>
>
>
>
>
> -----------------------------------------------------
> Eric Pugh | Principal | OpenSource Connections, LLC | 434.466.1467 |
> http://www.opensourceconnections.com
> Co-Author: Apache Solr 3 Enterprise Search Server available from
> http://www.packtpub.com/apache-solr-3-enterprise-search-server/book
> This e-mail and all contents, including attachments, is considered to be
> Company Confidential unless explicitly stated otherwise, regardless of
> whether attachments are marked as such.
>
>
>
>
>
>
>
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Mime
View raw message