ctakes-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörn Kottmann <kottm...@gmail.com>
Subject Re: sentence detector newline behavior
Date Sat, 25 Jan 2014 17:23:55 GMT
On 01/25/2014 03:03 PM, Miller, Timothy wrote:
> I'm running into one issue, it gets tripped up on sentences with
> line-ending spaces.  I could easily remove them with a script but by
> default they are in there. It happens when a sentence example ends:
>
> ...BILAT HEMATOMAS.  <LF>
>
> (There is a period, then 2 spaces, then the line feed character.) I am
> pretty sure this is the root because when I fix this example to be .<LF>
> it gets tripped up in another place instead (with the same error). The
> specific error I get is this:
>

What happens here is probably that two sentences are detected. It wants 
to split on
the dot, and on the <LF>.

The sentence detector classifies every eos char if it could be a split 
or not. On the other hand
the user expects to get a span (with begin and end offset) per sentence. 
The code which computes
the spans tries to remove white space from it.

Removing the white space from a whitespace only sentence is causing the 
exception your are seeing.

Which response would you expect from the sentence detector? Should a 
white space only sentence be returned?

In case a sentence is terminated by a new line. Should the new line char 
be included in the sentence span or not?

Jörn

Mime
View raw message