incubator-opennlp-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jason Baldridge <jasonbaldri...@gmail.com>
Subject universal tagset
Date Thu, 14 Apr 2011 03:15:19 GMT
For many applications, it would be useful to have a universal tagset for any
language you are working with. See below for details on a project that
provides mappings from many standard treebanks to a course-grained tagset
(12 tags). We might want to support these mappings to simple tags in our
models (e.g. have a model that uses corpus-native tags and another that uses
universal tags).

Jason

-- 
Jason Baldridge
Assistant Professor, Department of Linguistics
The University of Texas at Austin
http://www.jasonbaldridge.com
http://twitter.com/jasonbaldridge


Hi everyone,

some of you have already heard about our universal part-of-speech
tagset (and are even using it), to others this might be new.

We sat down and read through the annotation guidelines of 25 treebanks
and created a mapping to an universal set of 12 coarse-grained
part-of-speech categories. We have described the tagset and
illustrated some use cases in a short write-up (see attached pdf).
Additionally, we have uploaded the mappings to a code repository with
version control so that new languages can be added or modification can
be made if necessary:
http://code.google.com/p/universal-pos-tags/

The paper is for now on arXiv:
http://arxiv.org/abs/1104.2086

We hope that you will find this resource useful for your own work. Let
us know if you have any comments,
Cheers,
Dipanjan, Ryan & Slav

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message