incubator-opennlp-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jason Baldridge <>
Subject universal tagset
Date Thu, 14 Apr 2011 03:15:19 GMT
For many applications, it would be useful to have a universal tagset for any
language you are working with. See below for details on a project that
provides mappings from many standard treebanks to a course-grained tagset
(12 tags). We might want to support these mappings to simple tags in our
models (e.g. have a model that uses corpus-native tags and another that uses
universal tags).


Jason Baldridge
Assistant Professor, Department of Linguistics
The University of Texas at Austin

Hi everyone,

some of you have already heard about our universal part-of-speech
tagset (and are even using it), to others this might be new.

We sat down and read through the annotation guidelines of 25 treebanks
and created a mapping to an universal set of 12 coarse-grained
part-of-speech categories. We have described the tagset and
illustrated some use cases in a short write-up (see attached pdf).
Additionally, we have uploaded the mappings to a code repository with
version control so that new languages can be added or modification can
be made if necessary:

The paper is for now on arXiv:

We hope that you will find this resource useful for your own work. Let
us know if you have any comments,
Dipanjan, Ryan & Slav

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message