ctakes-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mattmann, Chris A (3980)" <chris.a.mattm...@jpl.nasa.gov>
Subject Re: How to update cTAKES so that new top level categories come out based on local dictionary?
Date Thu, 08 Oct 2015 02:39:16 GMT
Thank you Sean.

This has been tremendously helpful. One last question:

How would I add the top level category or a new one besides,
e.g., MedicalMention, or ProceduralMention, etc.

For example I see the CID, and BSV files etc as the “customvalue”
within e.g., 

MedicalMention
  > customvalue
  > customvalue2

..

I would like to now add:

NewCategory
  > customvalue
  > customvalue2

How would I add the NewCategory group?

Thank you so much. FWIW, this is for the Shangridocs project
that we are working on:

http://github.com/chrismattmann/shangridocs/

it combines Tika, cTAKES, Solr and Wicket all from Apache to make
an interactive NER application to extract knowledge from PDFs/medical
papers and to allow it to be combined and searched with private medical
data indexed in Solr.

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++





-----Original Message-----
From: "Finan, Sean" <Sean.Finan@childrens.harvard.edu>
Reply-To: "dev@ctakes.apache.org" <dev@ctakes.apache.org>
Date: Tuesday, October 6, 2015 at 2:04 PM
To: "dev@ctakes.apache.org" <dev@ctakes.apache.org>
Subject: RE: How to update cTAKES so that new top level categories come
out based on local dictionary?

>Hi Chris,
>
>I use bsv to denote "bar separated value" - also known as "pipe
>delimited".  I typically name the files with a ".bsv" extension, and they
>are just plain old boring ascii flat files.
>There should be multiple columns in the bsv file separated by the '|'
>character.  The following are all valid per-line formats:
>CUI|text
>CUI|TUI|text
>CUI|TUI|text|preferredText
>It doesn't matter which format you choose, the parser will auto-detect
>per-line.  Starting a line with "//" or "#" indicates that it is a
>comment and should be ignored.
>
>
>To add the bsv dictionary to your pipeline you just need to edit the
>resources/org/apache/ctakes/dictionary/lookup/fast/cTakesHsql.xml file
>and add a couple new sections.
>Within the <dictionaries> section, add:
>      <dictionary>
>         <name>CustomCuiRareWord</name>
>         
><implementationName>org.apache.ctakes.dictionary.lookup2.dictionary.BsvRar
>eWordDictionary</implementationName>
>         <properties>
>            <property key="bsvPath"
>value="org/apache/ctakes/dictionary/fast/example/custom_cui_tui_bsv.bsv"/>
>         </properties>
>      </dictionary>
>Within the <conceptFactories> section, add:
>      <conceptFactory>
>         <name>CustomCuiConcept</name>
>         
><implementationName>org.apache.ctakes.dictionary.lookup2.concept.BsvConcep
>tFactory</implementationName>
>         <properties>
>            <property key="bsvPath"
>value="org/apache/ctakes/dictionary/fast/example/custom_cui_tui_bsv.bsv"/>
>         </properties>
>      </conceptFactory>
>Within the <dictionaryConceptPairs> section, add:
>      <dictionaryConceptPair>
>         <name>CustomPair</name>
>         <dictionaryName>CustomCuiRareWord</dictionaryName>
>         <conceptFactoryName>CustomCuiConcept</conceptFactoryName>
>      </dictionaryConceptPair>
>You can change all of the [Custom**] names, and you should obviously
>point to the actual path of your bsv file.
>
>In addition to detecting your column count/style, upon loading the text
>will be lower-cased and tokenized and the terms will be indexed by rare
>word (for fast lookup).   Also, you do not need to write out the whole
>"C1234567" or "T123" cui tui codes.  The default prefix characters and
>padding zeros are automatically added.   Cuis "1" "01" "C1" "C01" will
>all be stored as "C0000001" and Tuis are handled likewise.  If you have
>custom cuis then it will honor non-"C" prefixes and still pad zeros
>automatically based upon the longest entry.  For instance, if your bsv
>has "CAM1", "CAM12" and "CAM12345" then the stored custom cuis should be
>"CAM00001", "CAM00012" and "CAM13245".
>
>I think that is about all that there is to it ...
>
>Sean
>
>-----Original Message-----
>From: Mattmann, Chris A (3980) [mailto:chris.a.mattmann@jpl.nasa.gov]
>Sent: Tuesday, October 06, 2015 4:31 PM
>To: dev@ctakes.apache.org
>Subject: Re: How to update cTAKES so that new top level categories come
>out based on local dictionary?
>
>Hi Sean,
>
>
>
>Thanks so much for your reply. For now I don’t care about the secondary
>
>codes and I for sure have < 1000 terms. Can you tell me how to wire up
>
>the BSV file by editing specific places in cTAKES? What specific commands
>
>should I run or what format should the BSV file look like? I must admit
>
>I have never heard of BSV files and the Internet varies on these between
>
>Bluespec System Verilog and BASIC bsave files.
>
>
>
>Then after I make the BSV file, what steps next? Recompile cTAKES? Can
>
>I take the BSV file and simply point to it from a binary installation of
>
>cTAKES? Thank you!
>
>
>
>Cheers,
>
>Chris
>
>
>
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>Chris Mattmann, Ph.D.
>
>Chief Architect
>
>Instrument Software and Science Data Systems Section (398)
>
>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>
>Office: 168-519, Mailstop: 168-527
>
>Email: chris.a.mattmann@nasa.gov
>
>WWW:  
>https://urldefense.proofpoint.com/v2/url?u=http-3A__sunset.usc.edu_-7Ematt
>mann_&d=BQIGaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZst
>TpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=bLdoNVceobXShsqfGFdPDKSiq2WNSUbGDHdvmrf
>Mj10&s=CXhGiFUuPnSekOe4GnsuxPOgYHbNp-hAnOD8jmB-lgc&e=
>
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>Adjunct Associate Professor, Computer Science Department
>
>University of Southern California, Los Angeles, CA 90089 USA
>
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
>
>
>
>
>
>
>-----Original Message-----
>
>From: "Finan, Sean" <Sean.Finan@childrens.harvard.edu>
>
>Reply-To: "dev@ctakes.apache.org" <dev@ctakes.apache.org>
>
>Date: Tuesday, October 6, 2015 at 8:05 AM
>
>To: "dev@ctakes.apache.org" <dev@ctakes.apache.org>
>
>Subject: RE: How to update cTAKES so that new top level categories come
>
>out based on local dictionary?
>
>
>
>>Hi Chris,
>
>>
>
>>There are a few ways to do this:
>
>>1.  Create an additional dictionary with the terms of interest and add it
>
>>as a source
>
>>2.  Create a new dictionary hsqldb that contains everything, old and new
>
>>3.  Add to the existing hsqldb dictionary
>
>>
>
>>The best approach for you would probably depend upon
>
>>1.  How many new terms you have
>
>>2.  Whether or not you desire additional codes, i.e. rxnorm, snomed
>
>>
>
>>If you don't have many new terms (<1000) and you don't care about
>
>>secondary codes then the easiest thing would be to create a BSV file with
>
>>the new terms and cuis.
>
>>
>
>>If you have a lot of new terms or do care about secondary codes, then a
>
>>less facile solution would be to create a new hsqldb with only the new
>
>>info or a complete replacement with new and old/existing terms.  Of the
>
>>two hsql options creating a new all-inclusive database would probably be
>
>>easier unless you want to learn the ins and outs of hsql.  If all of the
>
>>terms are in the umls, then the new all-inclusive hsqldb would definitely
>
>>be easiest (I think) as you could use the dictionary tool to create it.
>
>>
>
>>If you let me know your exact situation then I may be able to better
>
>>expound.
>
>>
>
>>Sean
>
>>
>
>>-----Original Message-----
>
>>From: Mattmann, Chris A (3980) [mailto:chris.a.mattmann@jpl.nasa.gov]
>
>>Sent: Monday, October 05, 2015 7:36 PM
>
>>To: dev@ctakes.apache.org
>
>>Subject: How to update cTAKES so that new top level categories come out
>
>>based on local dictionary?
>
>>
>
>>Hi cTAKES team,
>
>>
>
>>
>
>>
>
>>Hope you’re well! I had a quick question. I was wondering if someone
>
>>
>
>>could provide me a step-by-step guide to updating cTAKES to be based
>
>>
>
>>off a local dictionary, so that in addition to e.g.,
>
>>
>
>>
>
>>
>
>>ProceduralMention
>
>>
>
>>  Value1 position etc
>
>>
>
>>  Value2 position etc
>
>>
>
>>
>
>>
>
>>MedicationMention
>
>>
>
>>  Value1 position etc
>
>>
>
>>  Value2 position etc
>
>>
>
>>
>
>>
>
>>
>
>>
>
>>NewTopLevelCategoryFromMyDictionary
>
>>
>
>>  FoundValue1 position etc
>
>>
>
>>  FoundValue2 position etc
>
>>
>
>>
>
>>
>
>>
>
>>
>
>>I realize this has something to do with updating the annotation
>
>>
>
>>descriptions etc in XML, so if I someone just could tell me what
>
>>
>
>>to update I’d really appreciate it.
>
>>
>
>>
>
>>
>
>>Thank you!
>
>>
>
>>
>
>>
>
>>Cheers,
>
>>
>
>>Chris
>
>>
>
>>
>
>>
>
>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>>
>
>>Chris Mattmann, Ph.D.
>
>>
>
>>Chief Architect
>
>>
>
>>Instrument Software and Science Data Systems Section (398)
>
>>
>
>>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>
>>
>
>>Office: 168-519, Mailstop: 168-527
>
>>
>
>>Email: chris.a.mattmann@nasa.gov
>
>>
>
>>WWW:  
>
>>https://urldefense.proofpoint.com/v2/url?u=http-3A__sunset.usc.edu_-7Emat
>>t
>
>>mann_&d=BQIGaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZs
>>t
>
>>TpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=MEZE0aOE5pBHul1QA3A9xWbiwS6LzZaIq2rMw9
>>a
>
>>jiB0&s=cvi79MY1__guvBRsQmsYfc39lqPvv-1Yx1Pg8g5B0QU&e=
>
>>
>
>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>>
>
>>Adjunct Associate Professor, Computer Science Department
>
>>
>
>>University of Southern California, Los Angeles, CA 90089 USA
>
>>
>
>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>>
>
>>
>
>>
>
>>
>
>>
>
>>
>
>>
>
>
>

Mime
View raw message