opennlp-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark G <>
Subject request for Input or ideas.... EntityLinker tickets
Date Sat, 05 Oct 2013 21:58:51 GMT
Before I plug some tickets into Jira, I wanted to get some feedback from
the team on some changes I would like to make to the EntityLinker
Below are what I consider improvement tickets

1. Only the first start and end are populated in CountryContext object when
returned from CountryContext.find, it should return all instances of each
country mention in a map so the proximity of other toponyms to the found
country indicators can be included as a factor in the scoring

Currently the user only gets the first indexOf for each country mention.
The country mentions are an attempt to better gauge ambiguous names( Paris
Texas rather than Paris France). Because of this, I am not able to do a
proximity analysis thoroughly to assist in scoring. Basically I need every
mention of every country indicator in the doc, which I will correlate with
every Named Entity span to produce a score. I am also not passing the list
of country codes into the database query as a where predicate, which would
improve performance tremendously (I will index the column).

2. Discovery of indicators for "country context" should be regex based, in
order to provide a more robust ability to discover context

Currenty I use a String.indexOf(term) to discover the country hit list.
Regex would allow users to configure interesting ways to indicate
countries. Regex will also provide the array of start/end I need for issue
1 from its Matcher.find

3. fuzzy string matching should be part of the scoring, this would allow
mysql fuzzy search to return more candidate toponyms.

Currently, the search into the MySQL gazateers is using "boolean mode" and
each NER result is passed in as a literal string. If I implement a fuzzy
string matching based score (do we have one?) the user could turn on
"natural language" mode in MySQL then we can generate a score and thresh to
allow for more recall on transliterated names etc....
I would also like to use proximity to the majority of points in the
document as a disambiguation criteria as well.

4. provide a "solution wrapper" for the Geotagging capability

In order to make the GeoTagging a bit more "out of the box" functional, I
was thinking of creating a class that one calls find(MaxentModel, doc,
sentencedetector, EntityLinkerProperties) to abstract the current impl. I
know this is not standard practice, just want to see what you all think.
This would make it "easier" to get this thing running.


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message