incubator-stanbol-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rupert Westenthaler <rwes...@apache.org>
Subject Re: Add a custom thesaurus to RICK
Date Fri, 03 Dec 2010 18:16:59 GMT
Hi Florent

>> So if your Thesaurus contains Persons, Organizations and Locations,
>> than everything should work as expected. It your Thesaurus contains
>> other types of entities, than it would not work even with the engine I
>> currently implement.
>
> Because, my Thesaurus, contains so much others things than Persons,
> Organisations or Locations.
>
> So, if I well understand :
> In order to use RICK for "linking things" according to the Thesaurus, I
> have to first implement an Enhancement Engine ( something near
> eu.iksproject.fise.engines.opennlp.impl.NamedEntityExtractionEnhancementEngine)
> that detect string representation of all things presents in the Thesaurus.
>
> After getting that, I could use your "in construction" RICK engine.
>
> Do you think it could be ok ?

Yes one would need some Engine the preprocesses the parsed Text and
extracts candidates for the Entity-Linking-Engine because looking up
entities for every word in the parsed content will not be feasible for
real world scenarios. (However I would implement such an engine just
to give it a try ^^)

For the current EntityMentionEnhancementEngine the decision was to
lookup only Entities for Named Entities that where detected by the
NamedEntityExtractionEnhancementEngine.
This decision was also driven by the fact that the
EntityMentionEnhancementEngine works based on the dbPedia Dataset.
dbpedia contains entity for manny commonly used words. Therefore one
would end up with linked entities for most of the Words presents in
the parsed Text - something not very useful and also very expensive to
compute.

With new RICK based Entity-Linking-Engine, one has now the possibility
to use also other datasets. If one uses have a thesaurus with entities
that do not use labels of commonly used words it might be completely
feasible to process entity lookups for much more words and possible
also phrases within the parsed document ...
 ... and thats exactly my problem in answering your question. I am not
an expert with NLP. I can only describe the problem, but I do not know
the right answers. Olivier has much more knowledge in that area so
maybe he can contribute some thoughts as well.

But as mentioned above I could implement an engine that processes all
words in a text (maybe ignoring stop words). Such an engine would not
work with big datasets like dbPedia, geonames ... but when used
together with a relatively small Thesaurus (lets assume < 500.000
entities) it could work just fine.

best
Rupert Westenthaler

Mime
View raw message