incubator-stanbol-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tommaso Teofili <tommaso.teof...@gmail.com>
Subject Re: Add a custom thesaurus to RICK
Date Mon, 06 Dec 2010 11:42:22 GMT
Hi Florent

2010/12/6 Florent André <florent@apache.org>

> Hi,
>
> Try to do a synthesis on past discussions on this subject.
> Big thanks for your feedback.
>
> Some text here is copy-past from yours. Some of my text / idea may be
> wrong or misunderstanded as I'm a newbie in some domains.
>
> So... let's go ! :)
>
> == Problem :
>
> The problem is :
> How to enhance text with a custom controlled vocabulary (like a
> thesaurus) hold by Rick.
>
> == Solutions
>
> 3 options (third with 2 variants) :
>
> 1) UIMA :
> use an uima annotator (like the dictionary or the concept mapper one)
>
> 2) In memory :
> write a dedicated engine that loads the list of terms from rick into a
> inmemory java HashMap and scan the tokenized text (using either lucene's
> or opennlp's tokenizers to deal with the punctuation correctly) to look
> up your terms.
>
> 3) Solr / Lucene :
> 3-a) Use the rick solr interface with indirection between Rick and the
> index
> 3-b) use the Solr TermComponent that means that such requests can be
> directly executed on the inverted index. Lucene holds the inverted Index
> in memory what makes such queries really fast.
>
> == Questions / thoughts
>
> 1) UIMA
> - Here we can build our own dictionary (index). So the dictionary will
> contain all terms that we "want" (no more, no less).
>
> - One point will be to generate the dictionary with up-to-date datas
> from a Rick repository.
>
> 2) In memory
> - Have always all terms in-memory could be cost, moreover if we work
> with more than one thesaurus
> - As we start from "scratch" all routines (such as position in the text)
> and optimisations have to be done.
> - This approach seems close to uima dictionary one, without
> optimisations that could be done in uima...
>
> 3) Solr / Lucene
> - I don't know how is build the index in Lucene / Solr :
> -- we can "manually" control it ?
> -- the index depend on lucene internal routine ?
> -- the index contain all terms encountered ?
> -- How it's deal with composed term ?
>

you can specify each of these options with the use of the correct Analyzer
[1] implementation, eventually using TokenFIlters [2] to filter out
something.


>
> - Another point it that - in my mind - a search engine is build to
> answer this question :
> -- In *which* documents I have *this* term ?
> - And not really :
> -- In *this* document, which of *theses* terms I have ?
> (the Solr TermComponent seem to answer this second question though)
>
> == Result
>
> As far as I know, some uima integration is already done in FISE and no
> with solr / lucene. (So, I have some examples to base on ;) ).
>

There is an AnalysisEngine in UIMA to store annotations on a Lucene index
(but it  should be updated to latest Lucene version) called Lucas [3]. I
proposed a similar component for Solr that is on the way to be committed in
UIMA [4]. These deals with a pipeline UIMA -> Lucene/Solr.

There is also another patch I made for Solr [5] to enrich documents with
UIMA while they are being indexed.
Docs are sent to Solr for indexing, then (logically) go to UIMA just before
being indexed for enrichment and then enrichments are mapped back to Solr
fields (so this is Solr -> UIMA -> Solr).



>
> In-memory seems to be close to uima dictionary approach without possible
> uima optimizations.
>
> So, at this day, I think that the uima option seems to be the better.
>
>
> Thanks for your inputs.
>
> ++
>
>
At a high level I think that it makes sense to consider all these 3 solution
scenarios providing the proper design indirections since I can see different
concerns addressed by each.
Thanks Florent for your recap and thoughts.
Cheers,
Tommaso

[1] :
http://lucene.apache.org/java/3_0_0/api/core/org/apache/lucene/analysis/Analyzer.html
[2] :
http://lucene.apache.org/java/2_9_0/api/all/org/apache/lucene/analysis/TokenFilter.html
[3] : http://uima.apache.org/sandbox.html#lucas.consumer
[4] : http://markmail.org/thread/nmibmz5nd3oabu4v
[5] : https://issues.apache.org/jira/browse/SOLR-2129


>
> On 12/04/2010 01:40 PM, Tommaso Teofili wrote:
> > Hi Olivier
> >
> > 2010/12/4 Olivier Grisel <olivier.grisel@ensta.org>
> >
> >>
> >> If you want approximate match (like lucene fuzzy match) then using the
> >> rick solr interface might be ok but probably much slower. Myabe rick
> >> could offer a direct, in-JVM lucene API access to the solr indexes
> >> without the latency of the HTTP solr access.
> >>
> >
> > that makes sense, but note that if Solr is running in the same machine or
> > you can reach the index directory some way then you also can use an
> > EmbeddedSolrServer [1] without going through HTTP.
> > However I think that having an indirection between Rick and the index
> would
> > be nice, so I think Solr would be good for more use cases. But eventually
> we
> > can always provide different implementations of same interface (including
> > also Lucene).
> > My 2 cents,
> > Tommaso
> >
> > [1] : http://wiki.apache.org/solr/Solrj#EmbeddedSolrServer
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message