stanbol-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rupert Westenthaler <rupert.westentha...@gmail.com>
Subject Re: Opennlp NER ...
Date Sat, 03 Nov 2012 10:59:01 GMT
Hi

The implementation of the CustomNERModelEnhancementEngine
(STANBOL-792) is now available. The documentation can be found at [1].

I also updated the eHealth demo ("{stanbol-trunk}/demo/ehealth") to
use the new Engine with 5 custom NER models for DNA, RNA, Proteins,
Cell Type and Cell Line based on the BioNLP2004 dataset [2]. When you
build (mvn clean install and install the health demo bundle
(org.apache.stanbol.demo.ehealth-0.10.1-SNAPSHOT.jar) to the Stanbol
Launcher (revision > 1405306) than you can test the engine with the
chain http://localhost:8080/enhancer/chain/ehealth-ner

@Andrea: I was not able to test the engine with NER models that
extract multiple entity types, as I was not able to find/build such a
model for testing. So if you find any issues regarding that please
report it.

I dont think I will have time to work on STANBOL-793 the coming days
as ApacheCon is around the corner

best
Rupert

[1] http://stanbol.apache.org/docs/trunk/components/enhancer/engines/customnermodelengine.html
[2] http://www.nactem.ac.uk/tsujii/GENIA/ERtask/report.html

On Wed, Oct 31, 2012 at 5:22 PM, Rupert Westenthaler
<rupert.westenthaler@gmail.com> wrote:
> Hi
>
> just to lot you know that I can confirm that the type of the Named
> Entity is indeed provided by the Span#getType() method. So models for
> multiple Named Entity types are also supported by the Java API.
>
> best
> Rupert
>
> On Wed, Oct 31, 2012 at 3:45 PM, Rupert Westenthaler
> <rupert.westenthaler@gmail.com> wrote:
>> On Wed, Oct 31, 2012 at 3:31 PM, Andrea Taurchini <ataurchini@gmail.com> wrote:
>>> Dear Rupert,
>>> thanks again.
>>> Uhmmm ... using tokennamefinder from command line of opennlp if you use a
>>> multitype trained model than you get a multitype tagged output ... as for
>>> api .find method I suppose is the way you told me (one type per model ??).
>>>
>>
>> Maybe the Span#getType() returns the type of the found entity. I will
>> try this out. If this really provides the different types, that the
>> configuration will be like
>>
>>     {model-file-name};language={language};{type}={type-uri};{type2}={type-uri2};...
>>
>> BTW I created already
>> https://issues.apache.org/jira/browse/STANBOL-792 for this feature.
>>
>>> Forgive me if I'm silly but I can't see how can I add configuration
>>> property under configuration tab of Felix WC.
>>>
>>
>> The form you see in the configuration in generated from a XML file in
>> the Bundle and this XML file is generated by the @Property annotations
>> in the implementation of the Engine. So as soon as this new
>> configuration options are implemented you will see the according
>> options in the form.
>>
>>
>>> Thanks and best regards,
>>> Andrea
>>>
>>>
>>>
>>>
>>>
>>> 2012/10/31 Rupert Westenthaler <rupert.westenthaler@gmail.com>
>>>
>>>> Hi
>>>>
>>>> On Wed, Oct 31, 2012 at 2:25 PM, Andrea Taurchini <ataurchini@gmail.com>
>>>> wrote:
>>>> > Dear Rupert,
>>>> > as always thanks for your support.
>>>> > Is it possible to use a single model file to detect multiple dc-type
...
>>>> or
>>>> > should I add more than one configuration property each with the same
>>>> model
>>>> > file but different dc-type ... or else should I produce different model
>>>> > file.
>>>>
>>>> If this is possible with OpenNLP, than for sure, but AFAIK the
>>>> "opennlp.tools.namefind.NameFinderME#find(..)" method only provide the
>>>> token spans and probability. So it tells you only that you have found
>>>> an Named Entity from tokenA to tokenB and not the type of the Named
>>>> Entity.
>>>>
>>>> While I can imagine that one can train a model that detects different
>>>> types of entities, you will not know the specific type of an found
>>>> named entity. So found Entities may have any of the trained types.
>>>>
>>>> So if you want to distinguish between NamedEntities of the different
>>>> types you will need to train separate models.
>>>>
>>>> Please correct me if I am wrong.
>>>>
>>>> > However ... where do I have to set this configuration property (^_^)
?
>>>> > Throus OSGI admin ?
>>>>
>>>> Using the configuration tab of the Felix Web Console is only one
>>>> option. There are also other possibilities to provide configurations.
>>>> You can also provide configuration files to the Sling FileInstaller as
>>>> described at [1] and soon also under the new "Production" section on
>>>> the Stanbol webpage (currently only available on the staging server
>>>> [2])
>>>>
>>>>
>>>>
>>>> [1] http://markmail.org/message/jpxpl6x4nkmz6kda
>>>> [2] http://stanbol.staging.apache.org/production/partial-updates.html
>>>>
>>>> >
>>>> > Thanks a lot.
>>>> >
>>>> > Kindest regards,
>>>> > Andrea
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> > 2012/10/31 Rupert Westenthaler <rupert.westenthaler@gmail.com>
>>>> >
>>>> >> Hi Andrea,
>>>> >>
>>>> >> On Tue, Oct 30, 2012 at 4:15 PM, Andrea Taurchini <ataurchini@gmail.com
>>>> >
>>>> >> wrote:
>>>> >> > Dear All,
>>>> >> > I developed my own models for NER based on OPENNLP.
>>>> >> > Within these models I have more entities than person, organization
and
>>>> >> > places ... will stanbol enhance text using this added entities
?
>>>> >> >
>>>> >>
>>>> >> Currently both the OpenNLP NER engine as well as the
>>>> >> NamedEntityLinkingEngine can only handle Persons, Organizations
and
>>>> >> Places. In its current form you will not be able to use them to
link
>>>> >> other types.
>>>> >>
>>>> >> For both engines this is mainly because of the configuration. So
>>>> >> extending those engines to support other (or better arbitrary
>>>> >> configureable) types would require to extend the engines configuration
>>>> >> options. In the following I will try to describe the necessary
>>>> >> extensions.
>>>> >>
>>>> >> ## OpenNLP NER engine
>>>> >>
>>>> >> The NER engine needs the mappings for an {ner-model} to its {language}
>>>> >> and the extracted {entity-type}. Currently this works by a constant
>>>> >> defining the mappings for persons, organizations and places. NLP
>>>> >> models are loaded by using the OpenNLP service (defined by the
>>>> >> o.a.stanbol.commons.opennlp module).
>>>> >>
>>>> >> To configure additional models and types I would suggest to add
an
>>>> >> additional configuration property that uses the following syntax
>>>> >>
>>>> >>     {model-file-name};lang={language};type={entity-type}
>>>> >>
>>>> >> The OpenNLP TokenNameFinderModel would be loaded from the configured
>>>> >> "{model-file-name}" via the Stanbol DataFileProvider service.
>>>> >> practically this means that users would need to copy their custom
>>>> >> models to the "{stanbol.home}/datafiles" directory.
>>>> >>
>>>> >> The language parameter "lang={language}" would specify the language
>>>> >> supported by this model. The "type={entity-type}" parameter would
>>>> >> specify the dc-type value set for fise:TextAnnotations created for
>>>> >> named entities extracted by the model.
>>>> >>
>>>> >>
>>>> >> ## NamedEntityLinkingEngine
>>>> >>
>>>> >> For this engine the main problem with the current implementation
is
>>>> >> that the current way to configure mappings does not allow to configure
>>>> >> arbitrary mappings. Because of that one would need to implement
a
>>>> >> different approach to configure the mappings for linked
>>>> >> fise:TextAnnotations dc:type values.
>>>> >>
>>>> >> I would suggest to use a configuration similar to the "type mapping"
>>>> >> [1] as already used by the KeywordLinkingEngine. The Syntax would
be
>>>> >> like
>>>> >>
>>>> >>      {dc-type} > {vocabulary-type}; {vocabulary-type}; ...
>>>> >>      {dc-type} > *
>>>> >>      {dc-type}
>>>> >>
>>>> >> where the {dc-type} would be the value of the dc-type property of
the
>>>> >> TextAnnotation and {vocabulary-type} is the rdf:type value required
>>>> >> for linked Entities in the vocabulary linked against. * represents
the
>>>> >> wild-card (any type) and {dc-type} is a shorthand for {dc-type}
>
>>>> >> {dc-type}
>>>> >>
>>>> >> The current default mappings would be represented in this syntax
by
>>>> >>
>>>> >>     dbp-ont:Place
>>>> >>     dbp-ont:Person
>>>> >>     dbp-ont:Organisation
>>>> >>
>>>> >> I would suggest to keep support for the current properties for not
>>>> >> braking backward compatibility.
>>>> >>
>>>> >> If this extension is sufficient I suggest to create according JIRA
>>>> issues.
>>>> >>
>>>> >> best
>>>> >> Rupert
>>>> >>
>>>> >> [1]
>>>> >>
>>>> http://stanbol.apache.org/docs/trunk/components/enhancer/engines/keywordlinkingengine.html#type-mappings-syntax
>>>> >>
>>>> >> > Thanks and best regards,
>>>> >> > Andrea
>>>> >>
>>>> >>
>>>> >>
>>>> >> --
>>>> >> | Rupert Westenthaler             rupert.westenthaler@gmail.com
>>>> >> | Bodenlehenstraße 11                             ++43-699-11108907
>>>> >> | A-5500 Bischofshofen
>>>> >>
>>>>
>>>>
>>>>
>>>> --
>>>> | Rupert Westenthaler             rupert.westenthaler@gmail.com
>>>> | Bodenlehenstraße 11                             ++43-699-11108907
>>>> | A-5500 Bischofshofen
>>>>
>>
>>
>>
>> --
>> | Rupert Westenthaler             rupert.westenthaler@gmail.com
>> | Bodenlehenstraße 11                             ++43-699-11108907
>> | A-5500 Bischofshofen
>
>
>
> --
> | Rupert Westenthaler             rupert.westenthaler@gmail.com
> | Bodenlehenstraße 11                             ++43-699-11108907
> | A-5500 Bischofshofen



-- 
| Rupert Westenthaler             rupert.westenthaler@gmail.com
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen

Mime
View raw message