incubator-ctakes-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
Subject svn commit: r838531 - in /websites/staging/ctakes/trunk/content: ./ ctakes/2.6.0/ctakes-2.6-Chunker.html
Date Thu, 15 Nov 2012 22:34:53 GMT
Author: buildbot
Date: Thu Nov 15 22:34:52 2012
New Revision: 838531

Staging update by buildbot for ctakes

    websites/staging/ctakes/trunk/content/   (props changed)

Propchange: websites/staging/ctakes/trunk/content/
--- cms:source-revision (original)
+++ cms:source-revision Thu Nov 15 22:34:52 2012
@@ -1 +1 @@

Added: websites/staging/ctakes/trunk/content/ctakes/2.6.0/ctakes-2.6-Chunker.html
--- websites/staging/ctakes/trunk/content/ctakes/2.6.0/ctakes-2.6-Chunker.html (added)
+++ websites/staging/ctakes/trunk/content/ctakes/2.6.0/ctakes-2.6-Chunker.html Thu Nov 15
22:34:52 2012
@@ -0,0 +1,258 @@
+<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "">
+    Licensed to the Apache Software Foundation (ASF) under one or more
+    contributor license agreements.  See the NOTICE file distributed with
+    this work for additional information regarding copyright ownership.
+    The ASF licenses this file to You under the Apache License, Version 2.0
+    (the "License"); you may not use this file except in compliance with
+    the License.  You may obtain a copy of the License at
+ 2.0
+    Unless required by applicable law or agreed to in writing, software
+    distributed under the License is distributed on an "AS IS" BASIS,
+    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+    See the License for the specific language governing permissions and
+    limitations under the License.
+<link href="/ctakes/css/ctakes.css" rel="stylesheet" type="text/css">
+<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
+ <div class="banner">
+      <div id="bannerleft">
+		<a href=""><img src=""
alt="The Apache Software Foundation" border="0"/></a>
+	<br/>
+			<img alt="cTAKES logo" src="/ctakes/images/ctakes_logo.jpg" border="0"/>
+      </div>  
+    <div id="bannerright">	
+	      <img id="asf-logo" alt="Apache Incubator" src=""
+	  </div>
+ </div>  
+  <div id="clear"></div>
+  <div id="sidenav">
+    <h1 id="general">General</h1>
+<li><a href="/ctakes/index.html">About</a></li>
+<li><a href="/ctakes/gettingstarted.html">Getting Started</a></li>
+<li><a href="/ctakes/downloads.html">Downloads</a></li>
+<li><a href="/ctakes/glossary.html">Glossary</a></li>
+<h1 id="community">Community</h1>
+<li><a href="/ctakes/get-involved.html">Get Involved</a></li>
+<li><a href="">Bug Tracker</a></li>
+<li><a href="/ctakes/mailing-lists.html">Mailing Lists</a></li>
+<li><a href="/ctakes/people.html">People</a></li>
+<li><a href="">Incubator page</a></li>
+<li><a href="/ctakes/license.html">License</a></li>
+<li><a href="/ctakes/history.html">History</a></li>
+<li><a href="/ctakes/community-faqs.html">Community FAQs</a></li>
+<h1 id="users">Users</h1>
+<li><a href="/ctakes/userguide.html">User Guide</a></li>
+<li><a href="/ctakes/user-faqs.html">User FAQs</a></li>
+<h1 id="developers">Developers</h1>
+<li><a href="/ctakes/developerguide.html">Developer Guide</a></li>
+<li><a href="/ctakes/developer-faqs.html">Developer FAQs</a></li>
+<h1 id="ppmc">PPMC</h1>
+<li><a href="/ctakes/ppmc-faqs.html">PPMC FAQs</a></li>
+<li><a href="/ctakes/ctakes-release-guide.html">Release Guide</a> <br
+<h1 id="asf">ASF</h1>
+<li><a href="">Apache Software Foundation</a></li>
+<li><a href="">Thanks</a></li>
+<li><a href="">Become a Sponsor</a></li>
+  </div>
+  <div id="contenta">
+    <h1 id="ctakes-26-chunker">cTAKES 2.6 - Chunker</h1>
+<h2 id="overview-of-chunker">Overview of Chunker</h2>
+<p>In cTAKES when we refer to a "chunker" we often mean a shallow parser, i.e. a
+component that tags noun phrases, verb phrases, etc.</p>
+<p>This project supports three tasks:</p>
+<li>Building a model from training data;</li>
+<li>Tagging text, using a trained model;</li>
+<li>Adjusting the end offset of certain chunks so they envelop other chunks, for certain
patterns of chunks.</li>
+<p>This project provides a UIMA wrapper around the popular OpenNLP chunker. The
+UIMA examples project provides default wrappers for several of the components
+in OpenNLP, but not for the chunker. We have borrowed from the UIMA examples
+project liberally. Our wrapper works with our type system. Additionally, we
+added features and supporting components.</p>
+<p>A chunker model is included with this project.</p>
+<p><img alt="" src="/images/icons/emoticons/information.png" /></p>
+<p>The model derives from a combination of GENIA, Penn Treebank (Wall Street
+Journal) and anonymized clinical data per Safe Harbor HIPAA guidelines. Prior
+to model building the clinical data was deidentified for patient names to
+preserve patient confidentiality. Any person name in the model will originate
+from non-patient data sources.</p>
+<h2 id="building-a-model-prepare-genia-training-data">Building a model - Prepare GENIA
training data</h2>
+<p>You need to download a copy of GENIA's Treebank corpus from
+<a href=""></a>.
+version we used is called "beta". This version is distributed in a set of two
+files, one dated Sept. 22, 2004, with 200 "abstracts", and the other July 11,
+2005, with 300 "abstracts". Please download both. After extraction, place all
+the .tree files from the two download into one directory, which we'll refer to
+<p>Please also download <a href="">chunklink
from</a>. The version
+we used is This tool, from the <a href="">Induction
+Linguistic Knowledge (ILK)</a> group of Tilburg University,
+The Netherlands, converts Penn Treebank II files into a one-word-per-line
+<p>Next, we'll use data.chunk.genia.Genia2PTB to convert Genia Treebank corpus to
+Penn Treebank II format, then use chunklink to convert to chunk data, and
+finally use data.chunk.Chunklink2OpenNLP to convert to OpenNLP format.</p>
+<p><img alt="" src="/images/icons/emoticons/information.png" /></p>
+<p>This Java class a) renames the .tree files to files that look like
+wsj_0001.mrg and puts them in a directory structure expected by chunklink and
+creates a mapping of the original new names to the old names; b) reformats the
+way pos tags are formatted; c) adds an extra set of parentheses to each line
+of the data.</p>
+<li>Run data.chunk.genia.Genia2PTB:</li>
+<p><strong>java -cp</strong> <strong><em>&lt;classpath&gt;;</em></strong>
<strong>data.chunk.genia.Genia2PTB</strong> <strong><em>&lt;genia-trees&gt;</em></strong>
<strong><em>&lt;ptb-trees&gt;</em></strong><br />
+<p><strong><em>&lt;genia-trees&gt;</em></strong> is
the directory which holds the GENIA corpus files;<br />
+<strong><em>&lt;ptb-trees&gt;</em></strong> is the the directory
where the converted PTB trees will be written to;<br />
a file that will created by Genia2PTB to save file name mappings.</p>
+<p><img alt="" src="/images/icons/emoticons/check.png" /></p>
+<p><strong>Tip</strong><br />
+<p>There are a number of <strong>problematic sentences</strong> in the
second set of 300
+treebanked abstracts (in &lt;ptb-trees&gt; after processing by
+data.chunk.genia.Genia2PTB) that caused the chunklink script to fail. We
+removed them when building our model. The original GENIA file names are listed
+below for your reference. You need to remove the lines from the output of
+Genia2PTB. To find out the converted file names, please look at &lt;genia-ptb-
+<p>Line numbers are separated by commas.</p>
+<li>93123257.tree - 6</li>
+<li>93172387.tree - 3</li>
+<li>93186809.tree - 5</li>
+<li>93280865.tree - 7</li>
+<li>94085904.tree - 6</li>
+<li>94193110.tree - 2</li>
+<li>96247631.tree - 3, 5</li>
+<li>96353916.tree - 10</li>
+<li>96357043.tree - 4</li>
+<li>97031819.tree - 3, 4</li>
+<li>97054651.tree - 7</li>
+<li>97074532.tree - 6, 7</li>
+<li>Run chunklink:</li>
+<p><strong>perl -NHhftc</strong> <strong><em>&lt;ptb-trees&gt;
/wsj</em></strong><strong>????.mrg&gt;</strong> <strong>&lt;chunklink-chunks&gt;_</strong><br
is the redirected standard output from chunklink. <br />
+<p><img alt="" src="/images/icons/emoticons/information.png" /></p>
+<p>The chunklink script doesn't seem to work on Windows. But we did manage to run
+it in a Cygwin session.</p>
+<li>Run data.chunk.Chunklink2OpenNLP</li>
+<p><strong>java -cp</strong> <strong><em>&lt;classpath&gt;</em></strong>
<strong>data.chunk.Chunklink2OpenNLP</strong> <strong><em>&lt;chunklink-chunks&gt;
&lt;training-data&gt;</em></strong><br />
is the output of chunklink from the previous step.<br />
+<strong><em>&lt;training-data&gt;</em></strong> is the resulting
training data file.</p>
+<li>Prepare Penn Treebank training data</li>
+<p>Please refer to the section called <a href="">Obtaining
training data in the cTAKES
+documentation on
+SourceForge</a> on <a href="">how
+obtain Penn Treebank corpus</a>.</p>
+<p>Preparing Penn Treebank data is similar to preparing GENIA data, as described
+in the section called <a href="">Prepare
GENIA training data in the cTAKES documentation
+on SourceForge</a>,
+except that the first step is not necessary.</p>
+<li>Run chunklink:</li>
+<p><strong>perl -NHhftc</strong> <strong><em>&lt;ptb-corpus&gt;</em></strong>
<strong>/wsj_????.mrg &gt;</strong> <strong><em>&lt;chunklink-chunks&gt;</em></strong><br
+<strong><em>&lt;ptb-corpus&gt;</em></strong> is your Penn
Treebank corpus directory.<br />
+<strong><em>&lt;chunklink-chunks&gt;</em></strong> the redirected
standard output.</p>
+<li>Run Chunklink2OpenNLP</li>
+<p><strong>java -cp</strong> <strong><em>&lt;classpath&gt;</em></strong>
<strong>data.chunk.Chunklink2OpenNLP</strong> <strong><em>&lt;chunklink-chunks&gt;</em></strong>
is the output of chunklink from the previous step.<br />
+<strong><em>&lt;training-data&gt;</em></strong> is the resulting
training data file.<br />
+<strong>Build a model from your training data</strong><br />
+Building a chunker model is much easier than preparing the training data.
+After you have obtained training data, run the OpenNLP tool:</p>
+<p><strong>java -cp</strong> <strong><em>&lt;classpath&gt;</em></strong>
<strong></strong> <strong><em>&lt;training-data&gt;</em></strong>
<strong><em>&lt;model-name&gt;</em></strong> <strong><em>iterations</em></strong>
<strong><em>cutoff</em></strong><br />
is an OpenNLP training data file.<br />
+<strong><em>&lt;model-name&gt;</em></strong> is the file
name of the resulting model. The name should end with either .txt (for a plain text model)
or .bin.gz (for a compressed binary model).<br />
+<strong><em>iterations</em></strong> determines how many training
iterations will be performed. The default is 100.<br />
+<strong><em>cutoff</em></strong> determines the minimum number of
times a feature has to be seen to be considered for inclusion in the model.The default cutoff
is 5<br />
+The iterations and cutoff arguments are, taken together, optional, that is,
+you should provide both or provide neither.</p>
+<h2 id="analysis-engines-annotators">Analysis engines (annotators)</h2>
+<h3 id="chunkerxml">Chunker.xml</h3>
+<p>The file cTAKESdesc/chunkerdesc/analysis_engine/Chunker.xml provides a
+descriptor for the Chunker analysis engine which is the UIMA component we have
+written that wraps the OpenNLP chunker. It calls
+<strong>edu.mayo.bmi.uima.chunker.Chunker</strong>, whose Javadoc provides information
+how to customize this descriptor.</p>
+<p><strong>Parameters</strong><br />
+<p>the file that contains the chunker tagging model</p>
+<p>the full class name of an implementation of the interface
+<h3 id="chunkeraggregatexml">ChunkerAggregate.xml</h3>
+<p>The file cTAKESdesc/chunkerdesc/analysis_engine/ChunkerAggregate.xml provides
+a descriptor that defines a pipeline for shallow parsing so that all the
+necessary inputs (e.g. tokens, sentences, and POS tags) have been added to the
+CAS. It inherits two parameters from
+<a href="">Chunker.xml</a> and
three from
+<a href="">POSTagger.xml</a>.</p>
+<li>Start UIMA CPE GUI.</li>
+<p><strong>java -cp</strong> <strong><em>&lt;classpath&gt;</em></strong>
+<li>Open this file.</li>
+<li>Set the parameters for the collection reader to point to a local collection of
files that you want shallow parsed.</li>
+<li>Set the parameters for the Chunker as appropriate for your environment.</li>
+<li>Set the output directory of the XCAS Writer CAS Consumer.</li>
+<p>The results of running the pipeline are written to the output directory as
+XCAS files. These files can be viewed in the CAS Visual Debugger.</p>
+  </div>
+ <div id="footera">
+    <div id="copyrighta">
+      <p>Copyright &#169; 2011 The Apache Software Foundation, Licensed under the
<a href="">Apache License, Version 2.0</a>.<br/>Apache
and the Apache feather logo are trademarks of The Apache Software Foundation.</p>
+    </div>
+ </div>

View raw message