ctakes-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Finan, Sean" <Sean.Fi...@childrens.harvard.edu>
Subject RE: How to update cTAKES so that new top level categories come out based on local dictionary?
Date Thu, 08 Oct 2015 13:51:35 GMT
Hi Chris, to answer this question I'll direct you to the code in org.apache.ctakes.dictionary.lookup2.consumer.
DefaultTermConsumer#createSemanticAnnotation(...)

It takes an int and returns an IdentifiedAnnotation.  You can create a new (type system )
type and then make a custom TermConsumer that can create it in a fashion similar to the workings
of DefaultTermConsumer.  

However, you would want to create the IdentifiedAnnotation based upon something in addition
to the passed int.  You can pass along a cui or tui (better) and return the new IdentifiedAnnotation
based upon that if the tui is one of yours - otherwise use the normal returns.

This would all be in a new implementation of TermConsumer.consumerTypeIdHits(...)

So, the solution isn't without code change, but thems the breaks.  There has been a little
bit of discussion on adding types for standard [umls] semantic groups that aren't within the
current ctakes set ... please weigh in if you feel the need.

Sean
-----Original Message-----
From: Mattmann, Chris A (3980) [mailto:chris.a.mattmann@jpl.nasa.gov] 
Sent: Wednesday, October 07, 2015 10:39 PM
To: dev@ctakes.apache.org
Subject: Re: How to update cTAKES so that new top level categories come out based on local
dictionary?

Thank you Sean.



This has been tremendously helpful. One last question:



How would I add the top level category or a new one besides,

e.g., MedicalMention, or ProceduralMention, etc.



For example I see the CID, and BSV files etc as the “customvalue”

within e.g., 



MedicalMention

  > customvalue

  > customvalue2



..



I would like to now add:



NewCategory

  > customvalue

  > customvalue2



How would I add the NewCategory group?



Thank you so much. FWIW, this is for the Shangridocs project

that we are working on:



https://urldefense.proofpoint.com/v2/url?u=http-3A__github.com_chrismattmann_shangridocs_&d=BQIGaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=lvXu1UA5Tis3R-K9SJLRk8cWkj31RvxkVUdnSPQbvWY&s=6dlc6FS59cKKF9t5e0XJFotJ22KpkQ9zjWe8BJGa198&e=




it combines Tika, cTAKES, Solr and Wicket all from Apache to make

an interactive NER application to extract knowledge from PDFs/medical

papers and to allow it to be combined and searched with private medical

data indexed in Solr.



Cheers,

Chris



++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Chris Mattmann, Ph.D.

Chief Architect

Instrument Software and Science Data Systems Section (398)

NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA

Office: 168-519, Mailstop: 168-527

Email: chris.a.mattmann@nasa.gov

WWW:  https://urldefense.proofpoint.com/v2/url?u=http-3A__sunset.usc.edu_-7Emattmann_&d=BQIGaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=lvXu1UA5Tis3R-K9SJLRk8cWkj31RvxkVUdnSPQbvWY&s=KFniTLNNLNcMoSDkc497MlyGPi4XiN34ZYZkSqvq8C8&e=


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Adjunct Associate Professor, Computer Science Department

University of Southern California, Los Angeles, CA 90089 USA

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++











-----Original Message-----

From: "Finan, Sean" <Sean.Finan@childrens.harvard.edu>

Reply-To: "dev@ctakes.apache.org" <dev@ctakes.apache.org>

Date: Tuesday, October 6, 2015 at 2:04 PM

To: "dev@ctakes.apache.org" <dev@ctakes.apache.org>

Subject: RE: How to update cTAKES so that new top level categories come

out based on local dictionary?



>Hi Chris,

>

>I use bsv to denote "bar separated value" - also known as "pipe

>delimited".  I typically name the files with a ".bsv" extension, and they

>are just plain old boring ascii flat files.

>There should be multiple columns in the bsv file separated by the '|'

>character.  The following are all valid per-line formats:

>CUI|text

>CUI|TUI|text

>CUI|TUI|text|preferredText

>It doesn't matter which format you choose, the parser will auto-detect

>per-line.  Starting a line with "//" or "#" indicates that it is a

>comment and should be ignored.

>

>

>To add the bsv dictionary to your pipeline you just need to edit the

>resources/org/apache/ctakes/dictionary/lookup/fast/cTakesHsql.xml file

>and add a couple new sections.

>Within the <dictionaries> section, add:

>      <dictionary>

>         <name>CustomCuiRareWord</name>

>         

><implementationName>org.apache.ctakes.dictionary.lookup2.dictionary.BsvRar

>eWordDictionary</implementationName>

>         <properties>

>            <property key="bsvPath"

>value="org/apache/ctakes/dictionary/fast/example/custom_cui_tui_bsv.bsv"/>

>         </properties>

>      </dictionary>

>Within the <conceptFactories> section, add:

>      <conceptFactory>

>         <name>CustomCuiConcept</name>

>         

><implementationName>org.apache.ctakes.dictionary.lookup2.concept.BsvConcep

>tFactory</implementationName>

>         <properties>

>            <property key="bsvPath"

>value="org/apache/ctakes/dictionary/fast/example/custom_cui_tui_bsv.bsv"/>

>         </properties>

>      </conceptFactory>

>Within the <dictionaryConceptPairs> section, add:

>      <dictionaryConceptPair>

>         <name>CustomPair</name>

>         <dictionaryName>CustomCuiRareWord</dictionaryName>

>         <conceptFactoryName>CustomCuiConcept</conceptFactoryName>

>      </dictionaryConceptPair>

>You can change all of the [Custom**] names, and you should obviously

>point to the actual path of your bsv file.

>

>In addition to detecting your column count/style, upon loading the text

>will be lower-cased and tokenized and the terms will be indexed by rare

>word (for fast lookup).   Also, you do not need to write out the whole

>"C1234567" or "T123" cui tui codes.  The default prefix characters and

>padding zeros are automatically added.   Cuis "1" "01" "C1" "C01" will

>all be stored as "C0000001" and Tuis are handled likewise.  If you have

>custom cuis then it will honor non-"C" prefixes and still pad zeros

>automatically based upon the longest entry.  For instance, if your bsv

>has "CAM1", "CAM12" and "CAM12345" then the stored custom cuis should be

>"CAM00001", "CAM00012" and "CAM13245".

>

>I think that is about all that there is to it ...

>

>Sean

>

>-----Original Message-----

>From: Mattmann, Chris A (3980) [mailto:chris.a.mattmann@jpl.nasa.gov]

>Sent: Tuesday, October 06, 2015 4:31 PM

>To: dev@ctakes.apache.org

>Subject: Re: How to update cTAKES so that new top level categories come

>out based on local dictionary?

>

>Hi Sean,

>

>

>

>Thanks so much for your reply. For now I don’t care about the secondary

>

>codes and I for sure have < 1000 terms. Can you tell me how to wire up

>

>the BSV file by editing specific places in cTAKES? What specific commands

>

>should I run or what format should the BSV file look like? I must admit

>

>I have never heard of BSV files and the Internet varies on these between

>

>Bluespec System Verilog and BASIC bsave files.

>

>

>

>Then after I make the BSV file, what steps next? Recompile cTAKES? Can

>

>I take the BSV file and simply point to it from a binary installation of

>

>cTAKES? Thank you!

>

>

>

>Cheers,

>

>Chris

>

>

>

>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

>

>Chris Mattmann, Ph.D.

>

>Chief Architect

>

>Instrument Software and Science Data Systems Section (398)

>

>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA

>

>Office: 168-519, Mailstop: 168-527

>

>Email: chris.a.mattmann@nasa.gov

>

>WWW:  

>https://urldefense.proofpoint.com/v2/url?u=http-3A__sunset.usc.edu_-7Ematt

>mann_&d=BQIGaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZst

>TpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=bLdoNVceobXShsqfGFdPDKSiq2WNSUbGDHdvmrf

>Mj10&s=CXhGiFUuPnSekOe4GnsuxPOgYHbNp-hAnOD8jmB-lgc&e=

>

>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

>

>Adjunct Associate Professor, Computer Science Department

>

>University of Southern California, Los Angeles, CA 90089 USA

>

>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

>

>

>

>

>

>

>

>

>

>

>

>-----Original Message-----

>

>From: "Finan, Sean" <Sean.Finan@childrens.harvard.edu>

>

>Reply-To: "dev@ctakes.apache.org" <dev@ctakes.apache.org>

>

>Date: Tuesday, October 6, 2015 at 8:05 AM

>

>To: "dev@ctakes.apache.org" <dev@ctakes.apache.org>

>

>Subject: RE: How to update cTAKES so that new top level categories come

>

>out based on local dictionary?

>

>

>

>>Hi Chris,

>

>>

>

>>There are a few ways to do this:

>

>>1.  Create an additional dictionary with the terms of interest and add it

>

>>as a source

>

>>2.  Create a new dictionary hsqldb that contains everything, old and new

>

>>3.  Add to the existing hsqldb dictionary

>

>>

>

>>The best approach for you would probably depend upon

>

>>1.  How many new terms you have

>

>>2.  Whether or not you desire additional codes, i.e. rxnorm, snomed

>

>>

>

>>If you don't have many new terms (<1000) and you don't care about

>

>>secondary codes then the easiest thing would be to create a BSV file with

>

>>the new terms and cuis.

>

>>

>

>>If you have a lot of new terms or do care about secondary codes, then a

>

>>less facile solution would be to create a new hsqldb with only the new

>

>>info or a complete replacement with new and old/existing terms.  Of the

>

>>two hsql options creating a new all-inclusive database would probably be

>

>>easier unless you want to learn the ins and outs of hsql.  If all of the

>

>>terms are in the umls, then the new all-inclusive hsqldb would definitely

>

>>be easiest (I think) as you could use the dictionary tool to create it.

>

>>

>

>>If you let me know your exact situation then I may be able to better

>

>>expound.

>

>>

>

>>Sean

>

>>

>

>>-----Original Message-----

>

>>From: Mattmann, Chris A (3980) [mailto:chris.a.mattmann@jpl.nasa.gov]

>

>>Sent: Monday, October 05, 2015 7:36 PM

>

>>To: dev@ctakes.apache.org

>

>>Subject: How to update cTAKES so that new top level categories come out

>

>>based on local dictionary?

>

>>

>

>>Hi cTAKES team,

>

>>

>

>>

>

>>

>

>>Hope you’re well! I had a quick question. I was wondering if someone

>

>>

>

>>could provide me a step-by-step guide to updating cTAKES to be based

>

>>

>

>>off a local dictionary, so that in addition to e.g.,

>

>>

>

>>

>

>>

>

>>ProceduralMention

>

>>

>

>>  Value1 position etc

>

>>

>

>>  Value2 position etc

>

>>

>

>>

>

>>

>

>>MedicationMention

>

>>

>

>>  Value1 position etc

>

>>

>

>>  Value2 position etc

>

>>

>

>>

>

>>

>

>>

>

>>

>

>>NewTopLevelCategoryFromMyDictionary

>

>>

>

>>  FoundValue1 position etc

>

>>

>

>>  FoundValue2 position etc

>

>>

>

>>

>

>>

>

>>

>

>>

>

>>I realize this has something to do with updating the annotation

>

>>

>

>>descriptions etc in XML, so if I someone just could tell me what

>

>>

>

>>to update I’d really appreciate it.

>

>>

>

>>

>

>>

>

>>Thank you!

>

>>

>

>>

>

>>

>

>>Cheers,

>

>>

>

>>Chris

>

>>

>

>>

>

>>

>

>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

>

>>

>

>>Chris Mattmann, Ph.D.

>

>>

>

>>Chief Architect

>

>>

>

>>Instrument Software and Science Data Systems Section (398)

>

>>

>

>>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA

>

>>

>

>>Office: 168-519, Mailstop: 168-527

>

>>

>

>>Email: chris.a.mattmann@nasa.gov

>

>>

>

>>WWW:  

>

>>https://urldefense.proofpoint.com/v2/url?u=http-3A__sunset.usc.edu_-7Emat

>>t

>

>>mann_&d=BQIGaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZs

>>t

>

>>TpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=MEZE0aOE5pBHul1QA3A9xWbiwS6LzZaIq2rMw9

>>a

>

>>jiB0&s=cvi79MY1__guvBRsQmsYfc39lqPvv-1Yx1Pg8g5B0QU&e=

>

>>

>

>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

>

>>

>

>>Adjunct Associate Professor, Computer Science Department

>

>>

>

>>University of Southern California, Los Angeles, CA 90089 USA

>

>>

>

>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

>

>>

>

>>

>

>>

>

>>

>

>>

>

>>

>

>>

>

>

>



Mime
View raw message