ctakes-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From britt fitch <britt.fi...@wiredinformatics.com>
Subject Re: periods and the interaction with PTB & Fast Dict Lookup.
Date Fri, 17 Jul 2015 13:59:46 GMT
Hi Sean, do you want a ticket for the PTB update?

Cheers,

Britt



Britt Fitch
Wired Informatics
265 Franklin St Ste 1702
Boston, MA 02110
http://wiredinformatics.com
Britt.Fitch@wiredinformatics.com

> On Jul 15, 2015, at 9:07 AM, britt fitch <britt.fitch@wiredinformatics.com> wrote:
> 
> Thanks Sean.
> 
> The other part of the concern is if its reasonable/feasible to alter tokenization rules
for things like gene locations. I can work around this in a few ways but if there are other
examples of how this might come up in other cases it could be worth looking at a blanket change.
Sadly I don’t have another example off the top of my head, maybe organism names? Doing a
few queries for terms in the UMLS with periods the majority of them seem to be things you
really would want to split on. Perhaps genes are just an edge case.
> 
> I was looking at gene locations overall, not any particular gene or disorder grouping.
The term I mentioned was just meant to be an example.
> 
> 
> Britt Fitch
> Wired Informatics
> 265 Franklin St Ste 1702
> Boston, MA 02110
> http://wiredinformatics.com <http://wiredinformatics.com/>
> Britt.Fitch@wiredinformatics.com
> 
>> On Jul 15, 2015, at 8:57 AM, Finan, Sean <Sean.Finan@childrens.harvard.edu <mailto:Sean.Finan@childrens.harvard.edu>>
wrote:
>> 
>> Hi Britt,
>> 
>> The dictionary should be using ptb tokenization, but I obviously missed a rule and
separated the . from the following 2 in the dictionary.
>> 
>> I will double-check everything.
>> 
>> Sean
>> 
>> p.s. if you don’t mind my asking, are you looking into all connective tissue disorders
or just Shprintzen?
>> 
>> From: britt fitch [mailto:britt.fitch@wiredinformatics.com <mailto:britt.fitch@wiredinformatics.com>]
>> Sent: Tuesday, July 14, 2015 3:58 PM
>> To: dev@ctakes.apache.org <mailto:dev@ctakes.apache.org>
>> Subject: periods and the interaction with PTB & Fast Dict Lookup.
>> 
>> Another question/topic likely for Sean & Tim. Happy to get others’ feedback
as well.
>> 
>> I am trying to identify gene related information.
>> 
>> It appears that the PTB tokenization logic in places like the tokenizer & dictionary
building will split a string into multiple tokens if it is not a number and contains a period.
>> 
>> For example, given “22q11.2 deletion syndrome”:
>> 
>> PTB tokenizer: [22q11, .2, deletion, syndrome]
>> POS for the above term: [CD, CD, NN, NN]
>> Chunks for the above term: [B-NP, I-NP, I-NP, I-NP]
>> 
>> The same string creates a different split of [22q11, ., 2, deletion, syndrome] in
the new dictionary module (RareWordTermMapCreator.getTokens)
>> When the _rareWordTermMap gets created it uses the first token as the key: 22q11=[org.apache.ctakes.dictionary.lookup2.term.RareWordTerm@37917c4d]
>> 
>> The period-split difference above (period alone vs period + number) might be irrelevant
here because for the input “22q11.2 deletion syndrome”, the lookup indices are [2,3].
>> The new lookup will ignore incoming tokens “22q11” because its CD and “.2”
because its a number.
>> 
>> It looks like this concept might not be possible to be identified unless CD is allowed
as a lookup token POS.
>> Even if this is allowed though, in the case of gene locations I think the PTB rules
might not be the best fit.
>> 
>> Are there any thoughts/experiences regarding addressing the gene location mentions
like this?
>> Should the Fast Dict tokenization logic match the PTB tokenizer logic to produce
the same components?
>> 
>> Let me know if I read into one of these points wrong. Since these items would likely
cause large changes I am looking to get some feedback before moving forward.
>> 
>> Cheers,
>> 
>> Britt
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> Britt Fitch
>> Wired Informatics
>> 265 Franklin St Ste 1702
>> Boston, MA 02110
>> http://wiredinformatics.com <http://wiredinformatics.com/>
>> Britt.Fitch@wiredinformatics.com <mailto:Britt.Fitch@wiredinformatics.com><mailto:Britt.Fitch@wiredinformatics.com
<mailto:Britt.Fitch@wiredinformatics.com>>
> 


Mime
View raw message