ctakes-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From britt fitch <britt.fi...@wiredinformatics.com>
Subject periods and the interaction with PTB & Fast Dict Lookup.
Date Tue, 14 Jul 2015 19:56:52 GMT
Another question/topic likely for Sean & Tim. Happy to get others’ feedback as well.

I am trying to identify gene related information.

It appears that the PTB tokenization logic in places like the tokenizer & dictionary building
will split a string into multiple tokens if it is not a number and contains a period.

For example, given “22q11.2 deletion syndrome”:

PTB tokenizer: [22q11, .2, deletion, syndrome]
POS for the above term: [CD, CD, NN, NN]
Chunks for the above term: [B-NP, I-NP, I-NP, I-NP]

The same string creates a different split of [22q11, ., 2, deletion, syndrome] in the new
dictionary module (RareWordTermMapCreator.getTokens)
When the _rareWordTermMap gets created it uses the first token as the key: 22q11=[org.apache.ctakes.dictionary.lookup2.term.RareWordTerm@37917c4d]

The period-split difference above (period alone vs period + number) might be irrelevant here
because for the input “22q11.2 deletion syndrome”, the lookup indices are [2,3].
The new lookup will ignore incoming tokens “22q11” because its CD and “.2” because
its a number.

It looks like this concept might not be possible to be identified unless CD is allowed as
a lookup token POS.
Even if this is allowed though, in the case of gene locations I think the PTB rules might
not be the best fit.

Are there any thoughts/experiences regarding addressing the gene location mentions like this?
Should the Fast Dict tokenization logic match the PTB tokenizer logic to produce the same
components?

Let me know if I read into one of these points wrong. Since these items would likely cause
large changes I am looking to get some feedback before moving forward.

Cheers,

Britt


Britt Fitch
Wired Informatics
265 Franklin St Ste 1702
Boston, MA 02110
http://wiredinformatics.com
Britt.Fitch@wiredinformatics.com


Mime
View raw message