ctakes-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Finan, Sean" <Sean.Fi...@childrens.harvard.edu>
Subject RE: periods and the interaction with PTB & Fast Dict Lookup.
Date Wed, 15 Jul 2015 12:57:42 GMT
Hi Britt,

The dictionary should be using ptb tokenization, but I obviously missed a rule and separated
the . from the following 2 in the dictionary.

I will double-check everything.


p.s. if you don’t mind my asking, are you looking into all connective tissue disorders or
just Shprintzen?

From: britt fitch [mailto:britt.fitch@wiredinformatics.com]
Sent: Tuesday, July 14, 2015 3:58 PM
To: dev@ctakes.apache.org
Subject: periods and the interaction with PTB & Fast Dict Lookup.

Another question/topic likely for Sean & Tim. Happy to get others’ feedback as well.

I am trying to identify gene related information.

It appears that the PTB tokenization logic in places like the tokenizer & dictionary building
will split a string into multiple tokens if it is not a number and contains a period.

For example, given “22q11.2 deletion syndrome”:

PTB tokenizer: [22q11, .2, deletion, syndrome]
POS for the above term: [CD, CD, NN, NN]
Chunks for the above term: [B-NP, I-NP, I-NP, I-NP]

The same string creates a different split of [22q11, ., 2, deletion, syndrome] in the new
dictionary module (RareWordTermMapCreator.getTokens)
When the _rareWordTermMap gets created it uses the first token as the key: 22q11=[org.apache.ctakes.dictionary.lookup2.term.RareWordTerm@37917c4d]

The period-split difference above (period alone vs period + number) might be irrelevant here
because for the input “22q11.2 deletion syndrome”, the lookup indices are [2,3].
The new lookup will ignore incoming tokens “22q11” because its CD and “.2” because
its a number.

It looks like this concept might not be possible to be identified unless CD is allowed as
a lookup token POS.
Even if this is allowed though, in the case of gene locations I think the PTB rules might
not be the best fit.

Are there any thoughts/experiences regarding addressing the gene location mentions like this?
Should the Fast Dict tokenization logic match the PTB tokenizer logic to produce the same

Let me know if I read into one of these points wrong. Since these items would likely cause
large changes I am looking to get some feedback before moving forward.



Britt Fitch
Wired Informatics
265 Franklin St Ste 1702
Boston, MA 02110

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message