incubator-ctakes-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From build...@apache.org
Subject svn commit: r838541 - in /websites/staging/ctakes/trunk/content: ./ ctakes/2.6.0/ctakes-2.6-Core.html
Date Thu, 15 Nov 2012 22:46:31 GMT
Author: buildbot
Date: Thu Nov 15 22:46:31 2012
New Revision: 838541

Log:
Staging update by buildbot for ctakes

Added:
    websites/staging/ctakes/trunk/content/ctakes/2.6.0/ctakes-2.6-Core.html
Modified:
    websites/staging/ctakes/trunk/content/   (props changed)

Propchange: websites/staging/ctakes/trunk/content/
------------------------------------------------------------------------------
--- cms:source-revision (original)
+++ cms:source-revision Thu Nov 15 22:46:31 2012
@@ -1 +1 @@
-1410078
+1410083

Added: websites/staging/ctakes/trunk/content/ctakes/2.6.0/ctakes-2.6-Core.html
==============================================================================
--- websites/staging/ctakes/trunk/content/ctakes/2.6.0/ctakes-2.6-Core.html (added)
+++ websites/staging/ctakes/trunk/content/ctakes/2.6.0/ctakes-2.6-Core.html Thu Nov 15 22:46:31
2012
@@ -0,0 +1,242 @@
+<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
+<html>
+<head>
+<!--
+ 
+    Licensed to the Apache Software Foundation (ASF) under one or more
+    contributor license agreements.  See the NOTICE file distributed with
+    this work for additional information regarding copyright ownership.
+    The ASF licenses this file to You under the Apache License, Version 2.0
+    (the "License"); you may not use this file except in compliance with
+    the License.  You may obtain a copy of the License at
+ 
+       http://www.apache.org/licenses/LICENSE- 2.0
+ 
+    Unless required by applicable law or agreed to in writing, software
+    distributed under the License is distributed on an "AS IS" BASIS,
+    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+    See the License for the specific language governing permissions and
+    limitations under the License.
+-->
+
+<link href="/ctakes/css/ctakes.css" rel="stylesheet" type="text/css">
+
+<title>cTAKES 2.6 Core</title>
+<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
+
+</head>
+ 
+<body>
+ <div class="banner">
+      <div id="bannerleft">
+		<a href="http://www.apache.org/"><img src="http://www.apache.org/images/asf_logo_wide.gif"
alt="The Apache Software Foundation" border="0"/></a>
+	<br/>
+			<img alt="cTAKES logo" src="/ctakes/images/ctakes_logo.jpg" border="0"/>
+      </div>  
+    <div id="bannerright">	
+	      <img id="asf-logo" alt="Apache Incubator" src="http://incubator.apache.org/images/egg-logo.png"
border="0"/></a>			
+	  </div>
+ </div>  
+  <div id="clear"></div>
+
+
+  <div id="sidenav">
+    <h1 id="general">General</h1>
+<ul>
+<li><a href="/ctakes/index.html">About</a></li>
+<li><a href="/ctakes/gettingstarted.html">Getting Started</a></li>
+<li><a href="/ctakes/downloads.html">Downloads</a></li>
+<li><a href="/ctakes/glossary.html">Glossary</a></li>
+</ul>
+<h1 id="community">Community</h1>
+<ul>
+<li><a href="/ctakes/get-involved.html">Get Involved</a></li>
+<li><a href="https://issues.apache.org/jira/browse/ctakes">Bug Tracker</a></li>
+<li><a href="/ctakes/mailing-lists.html">Mailing Lists</a></li>
+<li><a href="/ctakes/people.html">People</a></li>
+<li><a href="http://incubator.apache.org/projects/ctakes.html">Incubator page</a></li>
+<li><a href="/ctakes/license.html">License</a></li>
+<li><a href="/ctakes/history.html">History</a></li>
+<li><a href="/ctakes/community-faqs.html">Community FAQs</a></li>
+</ul>
+<h1 id="users">Users</h1>
+<ul>
+<li><a href="/ctakes/userguide.html">User Guide</a></li>
+<li><a href="/ctakes/user-faqs.html">User FAQs</a></li>
+</ul>
+<h1 id="developers">Developers</h1>
+<ul>
+<li><a href="/ctakes/developerguide.html">Developer Guide</a></li>
+<li><a href="/ctakes/developer-faqs.html">Developer FAQs</a></li>
+</ul>
+<h1 id="ppmc">PPMC</h1>
+<ul>
+<li><a href="/ctakes/ppmc-faqs.html">PPMC FAQs</a></li>
+<li><a href="/ctakes/ctakes-release-guide.html">Release Guide</a> <br
/>
+</li>
+</ul>
+<h1 id="asf">ASF</h1>
+<ul>
+<li><a href="http://www.apache.org">Apache Software Foundation</a></li>
+<li><a href="http://www.apache.org/foundation/thanks.html">Thanks</a></li>
+<li><a href="http://www.apache.org/foundation/sponsorship.html">Become a Sponsor</a></li>
+</ul>
+  </div>
+  <div id="contenta">
+    <h1 id="ctakes-26-core">cTAKES 2.6 - Core</h1>
+<h2 id="overview-of-core">Overview of Core</h2>
+<p>This project contains several annotators, including:</p>
+<ul>
+<li>a sentence detector annotator</li>
+<li>a tokenizer</li>
+<li>an annotator that does not update the CAS in any way</li>
+<li>an annotator that creates a single Segment annotation encompassing the entire document
text</li>
+</ul>
+<p><img alt="" src="/images/icons/emoticons/information.png" /></p>
+<p>End-of-line characters are considered end-of-sentence markers. Hyphenated
+words that appear in the hyphenated words list with frequency values greater
+than the FreqCutoff will be considered one token. Refer to <a href="http://ohnlp.sourceforge.net/cTAKES/#tokenizer_annot">the
tokenizer
+information on
+SourceForge.net</a>.</p>
+<p>A sentence detector model is included with this project.</p>
+<p><img alt="" src="/images/icons/emoticons/information.png" /></p>
+<p>The model derives from a combination of GENIA, Penn Treebank (Wall Street
+Journal) and anonymized clinical data per Safe Harbor HIPAA guidelines. Prior
+to model building the clinical data was deidentified for patient names to
+preserve patient confidentiality. Any person name in the model will originate
+from non-patient data sources.</p>
+<h2 id="analysis-engines-annotators">Analysis engines (annotators)</h2>
+<h3 id="aggregateaexml">AggregateAE.xml</h3>
+<p>This descriptor is included for testing. This descriptor is typically not used
+in a more complete pipeline. One or more of the individual analysis engines is
+normally included.</p>
+<h3 id="copyannotatorxml">CopyAnnotator.xml</h3>
+<p>This is a utility annotator that copies data from an existing JCas object into
+a new JCas object.</p>
+<h3 id="nullannotatorxml">NullAnnotator.xml</h3>
+<p>As its name implies, this annotator does nothing. It can be useful if you are
+using the UIMA CPE GUI and you are required to choose an analysis engine but
+you don't actually want to use one.</p>
+<h3 id="overlapannotatorxml">OverlapAnnotator.xml</h3>
+<ul>
+<li>An annotator that modifies one annotation (begin and end offsets) or deletes one
(or both) of the annotations, when two annotations overlap. The action taken depends on the
configuration parameters. It can extend an annotation to encompass overlapping annotations.
It can also be configured to delete annotations of type A that are subsumed by other annotations
of type A if you only want the longest annotations of the given type to be kept.</li>
+<li>Refer to the <em>Javadoc</em> for <em>edu.mayo.bmi.uima.core.ae.OverlapAnnotator</em>
for more details.</li>
+</ul>
+<h3 id="sentencedetectorannotatorxml">SentenceDetectorAnnotator.xml</h3>
+<p>A wrapper around the <a href="http://opennlp.sourceforge.net/">OpenNLP</a>
sentence
+detector that creates Sentence annotations based on the location of end-of-
+line characters and on the output of the OpenNLP sentence detector. This
+annotator considers an end-of-line character as an end-of-sentence marker.
+Optionally it can skip certain sections of the document. See the section
+called <a href="http://ohnlp.sourceforge.net/cTAKES/#run_sentdetect_token_annot">Running
the sentence detector and
+tokenizer</a>
+for more details.</p>
+<p><strong>Parameters</strong><br />
+SegmentsToSkip</p>
+<p>(optional) the list of sections not to create Sentence annotations for.</p>
+<p><strong>Resources</strong><br />
+MaxentModelFile&gt;</p>
+<p>the Maxent model sentence detector.</p>
+<h3 id="simplesegmentannotatorxml">SimpleSegmentAnnotator.xml</h3>
+<p>Creates a single Segment annotation, encompassing the entire document. For use
+prior to annotators that require a Segment annotation, when the pipeline does
+not contain a different annotator that creates Segment annotations. This
+annotator is used for plain text files, which doesn't have section (aka
+segment) tags; but not for CDA documents, as the CdaCasInitializer annotator
+creates Segment annotations.</p>
+<p><strong>Parameters</strong><br />
+SegmentID</p>
+<p>(optional) the identifier to use for the Segment annotation created.</p>
+<h3 id="tokenizerannotatorxml">TokenizerAnnotator.xml</h3>
+<p>Tokenizes text according to Penn Treebank tokenization rules.  This is the
+default tokenizer for cTAKES as of cTAKES 2.0.</p>
+<p><strong>Parameters</strong><br />
+SegmentsToSkip</p>
+<p>(optional) the list of sections not to create token annotations for.</p>
+<h3 id="tokenizerannotatorversion1xml">TokenizerAnnotatorVersion1.xml</h3>
+<p>This is the original cTAKES tokenizer. Hyphenated words that appear in the
+hyphenated words list (HyphFreqFile) with frequency values greater than the
+FreqCutoff will be considered one token. See classes
+<em>edu.mayo.bmi.uima.core.ae.TokenizerAnnotator</em> and
+<em>edu.mayo.bmi.nlp.tokenizer.Tokenizer</em> for implementation details.</p>
+<p><strong>Parameters</strong><br />
+SegmentsToSkip</p>
+<p>(optional) the list of sections not to create token annotations for.</p>
+<p>FreqCutoff</p>
+<p>cutoff value for which entries to include from the hyphenated words
+list(HyphFreqFile)</p>
+<p><strong>Resources</strong><br />
+HyphFreqFile</p>
+<p>a file containing a list of hyphenated words and their frequency within some
+corpus.</p>
+<h2 id="tools-training-a-sentence-detector-model">Tools Training a sentence detector
model</h2>
+<p>To train a sentence detector that recognizes the same set of candidate end-of-
+sentence characters that the <a href="http://ohnlp.sourcefo
+rge.net/cTAKES/#sentdetect_annot">SentenceDetectorAnnotator</a> uses:</p>
+<p><strong>java -cp &lt;classpath&gt; edu.mayo.bmi.uima.core.ae.SentenceDetector</strong>
<strong><em>&lt;sents_file&gt;</em></strong> <strong><em>&lt;model&gt;</em></strong>
<strong><em>&lt;iters&gt;</em></strong> <strong><em>&lt;cut&gt;</em></strong><br
/>
+Where</p>
+<ul>
+<li><em>&lt;sents_file&gt;</em>* is your sentences training data
file, one sentence per line, see an example in Example 4.1, "Sentence detector training data
file sample".</li>
+<li><em>&lt;model&gt;</em>* is the name of the model file to be
created.</li>
+<li><em>&lt;iters&gt;</em>* (optional) is the number of iterations
for training.</li>
+<li><em>&lt;cut&gt;</em>* (optional) is the cutoff value.</li>
+</ul>
+<p><img alt="" src="/images/icons/emoticons/check.png" /></p>
+<p><strong>Tip</strong><br />
+</p>
+<p>Eclipse users may run "SentenceDetector--train_ a_ new_model" launch.</p>
+<p><strong>Example 4.1. Sentence detector training data file sample</strong><br
/>
+One sentence per line.</p>
+<p>The boy ran.</p>
+<p>Did the girl run too?</p>
+<p>Yes, she did.</p>
+<p>Where did she go?</p>
+<h3 id="verify-you-can-train-a-sentence-detector-model-successfully">Verify you can
train a sentence detector model successfully</h3>
+<p>The sample model resources/sentdetect/sample_sd_included.mod was trained from
+data/test/sample_sd_training_sentences.txt, using default values (not
+specifying on the command line) for "iters" and "cut". You can verify your
+trained model with the sample one, using your favorite tool.</p>
+<h3 id="using-opennlp-directly-to-train-sentence-detector-model">Using OpenNLP directly
to train sentence detector model</h3>
+<p>You can train a sentence detector directly using the OpenNLP sentence detector
+(SentenceDetectorME) with the default set of candidate end-of-sentence
+characters, using:</p>
+<p>The four parameters have the same meaning as the tool we provided, "infile"
+uses the same format as in Example 4.1, "Sentence detector training data file
+sample".</p>
+<h2 id="running-the-sentence-detector-and-tokenizer">Running the sentence detector
and tokenizer</h2>
+<p>We provided a sentence detector CPE descriptor and a tokenizer CPE descriptor
+in this project. To run the CPE:</p>
+<p><strong><em>&lt;iters&gt;</em></strong>java -cp
&lt;classpath&gt; org.apache.uima.tools.cpm.CpmFrame<br />
+Open</p>
+<p><strong><em>&lt;iters&gt;</em></strong>desc/collection_processing_engine/SentenceDetecorCPE.xml
to run a sentence detector; or<br />
+<strong><em>&lt;iters&gt;</em></strong>desc/collection_processing_engine/SentencesAndTokensCPE.xml
to run a tokenizer</p>
+<p>The sentence detector CPE uses the analysis engines listed in
+desc/analysis_engine/SentenceDetectorAggregate.xml, and the tokenizer CPE uses
+those listed in desc/analysis_engine/SentencesAndTokensAggregate.xml. The two
+CPEs are defined to read from plain text file(s) in
+data/test/sample_notes_plaintext using the FilesInDirectoryCollectionReader.</p>
+<p>TIP Eclipse users may use the "SentenceDetector_annotator" and the "Tokenizer
+annotator" launches.</p>
+<h2 id="how-do-the-cpes-work">How do the CPEs work?</h2>
+<p>Since the sentence annotator processes the text one section at a time, there
+must be at least one section (segment) annotation for the
+SentenceDetectorAnnotator to add Sentence annotations. Therefore the first
+analysis engine is the SimpleSegmentAnnotator, which creates a single Segment
+annotation that covers the entire text. Then the SentenceDetectorAnnotator
+analysis engine adds Sentence annotations. Then if you're running the
+tokenizer, the TokenizerAnnotator analysis engine adds annotations for tokens,
+such as PunctuationToken, WordToken, NewlineToken.</p>
+<p>Strictly speaking, it would not be necessary to run the
+SentenceDetectorAnnotator in order to test the TokenizerAnnotator. The
+TokenizerAnnotator does not require the presence of Sentence annotations.</p>
+  </div>
+ 
+ <div id="footera">
+    <div id="copyrighta">
+      <p>Copyright &#169; 2011 The Apache Software Foundation, Licensed under the
<a href="http://www.apache.org/licenses/LICENSE-2.0">Apache License, Version 2.0</a>.<br/>Apache
and the Apache feather logo are trademarks of The Apache Software Foundation.</p>
+    </div>
+ </div>
+ 
+</body>
+</html>
+



Mime
View raw message