uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jens Grivolla <j+...@grivolla.net>
Subject CR+LF = 1 character?
Date Wed, 20 Apr 2011 08:58:04 GMT

while working on the integration between UIMA and a different text 
annotation system we ran into problems with differing offsets between 
the two systems.

As it turns out, the other system considers CR+LF (Windows style line 
endings) to be two characters, while UIMA sees it as one.  Clearly, 
CR+LF are two bytes in one-byte-per-character encodings (ASCII, Latin-1, 
...) so all systems based on those encodings will see it as two 
characters, and I believe it is also represented as two Unicode characters.

In a way it makes sense to consider a "newline" as one character, 
independently of how it is represented, so I think the UIMA way is fine. 
  But is there an overview somewhere how different systems and 
programming language handle this, e.g. when extracting substrings, etc.?

Given the mess that this can be it's probably best to normalize all text 
at the beginning to only deal with Unicode strings with LF endings, 
encoded with UTF-8 when writing to disk or otherwise serializing the data.

It would still be interesting to know how painful this can get when not 
normalizing, and e.g. passing data between UIMA (Java), NLTK (Python), 
our own C#-based system, etc.


View raw message