opennlp-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark G <ma...@apache.org>
Subject Re: request for Input or ideas.... EntityLinker tickets
Date Sat, 26 Oct 2013 16:08:03 GMT
I am looking at the EntityLinker interface, and I would like to add this
method (one which I think was proposed very early on). This allows for an
entire doc worth of NEs to be processed. Currently, if a scoring routine
needs all the results from the entire document, the scorer cannot be called
from within the EntityLinker impl. The below method allows for a user to
perform all NER as normal for an entire doc, then pass all that info into
this method. I realized this when writing the scoring algorithms for the
GeoEntityLinker... some require all the hits for the doc, some don't, so I
was using some scorers internally, then some after, it got messy and
confusing. This would also allow for better pipeline integration, so no
scorers would have to be chained after the EntityLinking, it would all
happen within.

Thoughts?

like this:
  public List<LinkedSpan> find(String doctext, Span[] sentences, String[][]
tokens, Span[][] names) {
    ArrayList<LinkedSpan> spans = new ArrayList<LinkedSpan>();
    for (int s = 0; s < sentences.length; s++) {
      for (String name : Span.spansToStrings(names[s], tokens[s])) {
        //do something
      }

    }  return spans;
  }


On Wed, Oct 23, 2013 at 11:36 AM, Mark G <giaconiamark@gmail.com> wrote:

> not sure if the in mem approach will provide the equivalent to full text
> indexing....but worth a try. Another design pattern is to just install one
> DB and have all the nodes connect. I have done this with Postgres on a
> 40ish node hadoop cluster. The queries against the db's full text index are
> not that expensive for mysql, it's not a complex query, just a seek on the
> full text index.  But, of course, it depends on how much concurrency it
> will get, which depends on how much data, nodes, and tasks you have....
> Generically I think the right answer is to be able to configure the
> connection behind the GeoEntityLinker... in mem || remote db || locahost db
>
>
>
> On Wed, Oct 23, 2013 at 8:46 AM, Jörn Kottmann <kottmann@gmail.com> wrote:
>
>> On 10/23/2013 01:14 PM, Mark G wrote:
>>
>>> All that being said, it is totally possible to run an in memory version
>>> of
>>> the gazateer. Personally, I like the DB approach, it provides a lot of
>>> flexibility and power.
>>>
>>
>> Yes, and you can even use a DB to run in-memory which works with the
>> current implementation,
>> I think I will experiment with that.
>>
>> I don't really mind using 3 GB memory for it, since my Hadoop servers
>> have more than enough anyway,
>> and it makes the deployment easier (don't have to deal with installing
>> MySQL
>> databases and keeping them in sync).
>>
>> Jörn
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message