Return-Path: Delivered-To: apmail-incubator-opennlp-dev-archive@minotaur.apache.org Received: (qmail 23554 invoked from network); 14 Apr 2011 03:15:48 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 14 Apr 2011 03:15:48 -0000 Received: (qmail 40237 invoked by uid 500); 14 Apr 2011 03:15:47 -0000 Delivered-To: apmail-incubator-opennlp-dev-archive@incubator.apache.org Received: (qmail 40187 invoked by uid 500); 14 Apr 2011 03:15:47 -0000 Mailing-List: contact opennlp-dev-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: opennlp-dev@incubator.apache.org Delivered-To: mailing list opennlp-dev@incubator.apache.org Received: (qmail 40151 invoked by uid 99); 14 Apr 2011 03:15:46 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 14 Apr 2011 03:15:46 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of jasonbaldridge@gmail.com designates 209.85.220.175 as permitted sender) Received: from [209.85.220.175] (HELO mail-vx0-f175.google.com) (209.85.220.175) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 14 Apr 2011 03:15:41 +0000 Received: by vxd7 with SMTP id 7so1182593vxd.6 for ; Wed, 13 Apr 2011 20:15:20 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:reply-to:date:message-id:subject :from:to:content-type; bh=kmNGaA6CBxEu2nXqltPIEn8j4+fL92+zkF15Dakxexc=; b=klHv6znM0iUM/1WAHiTfOvw40/a6eYdQgHllVL7lK+YUQXmVqFplXIgTR7/qaBs6rH o4SwCLAsM/0JM9Q8oCv0Xx1vRDcGRaLF8AsR3lJJxAQ0QFZrJL5ad+82kGsIZ0pxkUip qi27FCRBcB6bRRCXjFyiAGBmZOCA+S1TsFYM4= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:reply-to:date:message-id:subject:from:to:content-type; b=dF9LvuSGEb4UTvK22HGP6fFFzyQDHq4Z1wkltV3bRZMwWRrdXhZMVxR4fFU2Al57p/ hH/OrvCjW4p8IWklbjgbsplu0Tbw3xg65RhHUSAFU8ZLHHOnoj35j9cWnKTJQiZEWjoq C1eKSAVljJex2Lk/sSyYxEg7dIq73dSK68/Vg= MIME-Version: 1.0 Received: by 10.52.95.203 with SMTP id dm11mr324679vdb.213.1302750920021; Wed, 13 Apr 2011 20:15:20 -0700 (PDT) Received: by 10.52.158.163 with HTTP; Wed, 13 Apr 2011 20:15:19 -0700 (PDT) Reply-To: jbaldrid@mail.utexas.edu Date: Wed, 13 Apr 2011 22:15:19 -0500 Message-ID: Subject: universal tagset From: Jason Baldridge To: opennlp-dev@incubator.apache.org Content-Type: multipart/alternative; boundary=bcaec50162a1d1950204a0d8558d X-Virus-Checked: Checked by ClamAV on apache.org --bcaec50162a1d1950204a0d8558d Content-Type: text/plain; charset=ISO-8859-1 For many applications, it would be useful to have a universal tagset for any language you are working with. See below for details on a project that provides mappings from many standard treebanks to a course-grained tagset (12 tags). We might want to support these mappings to simple tags in our models (e.g. have a model that uses corpus-native tags and another that uses universal tags). Jason -- Jason Baldridge Assistant Professor, Department of Linguistics The University of Texas at Austin http://www.jasonbaldridge.com http://twitter.com/jasonbaldridge Hi everyone, some of you have already heard about our universal part-of-speech tagset (and are even using it), to others this might be new. We sat down and read through the annotation guidelines of 25 treebanks and created a mapping to an universal set of 12 coarse-grained part-of-speech categories. We have described the tagset and illustrated some use cases in a short write-up (see attached pdf). Additionally, we have uploaded the mappings to a code repository with version control so that new languages can be added or modification can be made if necessary: http://code.google.com/p/universal-pos-tags/ The paper is for now on arXiv: http://arxiv.org/abs/1104.2086 We hope that you will find this resource useful for your own work. Let us know if you have any comments, Cheers, Dipanjan, Ryan & Slav --bcaec50162a1d1950204a0d8558d--