opennlp-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörn Kottmann <kottm...@gmail.com>
Subject Re: OPENNLP-579
Date Thu, 30 May 2013 20:54:43 GMT
On 05/30/2013 10:19 PM, William Colen wrote:
> I could not understand what do you mean with using token offsets fot the
> sentences.

With the current approach in OpenNLP sentence detection is done before 
tokenization,
and both components output Spans which refer to character offsets.

But if you do tokenization first the sentence detector could output 
Spans which mark
the tokens in a sentence (like the name finder does with name Spans). 
This allows
to directly use a sentence Span to access the tokens of a sentence. 
Anyway thats also
easy with the current approach.

If you now go one step further a DocumentNameFinders find method could be:
Span[] find(String text, Span tokens[], Span sentences[])
or
Span[] find(String tokens[], Span sentences[])

In both cases sentences would contain Spans with token offsets.

Jörn

Mime
View raw message