ctakes-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From britt fitch <britt.fi...@wiredinformatics.com>
Subject Re: dictionary-look-fast fails to handle alternative CUIs
Date Thu, 09 Jul 2015 19:19:53 GMT
Absolutely. I’ll create it now.

Thanks!



Britt Fitch
Wired Informatics
265 Franklin St Ste 1702
Boston, MA 02110
http://wiredinformatics.com
Britt.Fitch@wiredinformatics.com

> On Jul 9, 2015, at 3:12 PM, Finan, Sean <Sean.Finan@childrens.harvard.edu> wrote:
> 
> Hi Britt,
> 
> I’ve got some code and tests to check in.  Would you like to write the jira item?
> 
> From: britt fitch [mailto:britt.fitch@wiredinformatics.com <mailto:britt.fitch@wiredinformatics.com>]
> Sent: Thursday, July 09, 2015 8:55 AM
> To: dev@ctakes.apache.org <mailto:dev@ctakes.apache.org>
> Subject: Re: dictionary-look-fast fails to handle alternative CUIs
> 
> I don’t think that is too much of a constraint, at least initially, to have all CUI
values a consistent length for a given prefix.
> 
> Thanks Sean, let me know if there is any part of this you’d like a hand with.
> 
> Cheers,
> 
> Britt
> 
> 
> 
> 
> 
> 
> 
> 
> 
> Britt Fitch
> Wired Informatics
> 265 Franklin St Ste 1702
> Boston, MA 02110
> http://wiredinformatics.com
> Britt.Fitch@wiredinformatics.com
> 
> On Jul 8, 2015, at 7:16 PM, Finan, Sean <Sean.Finan@childrens.harvard.edu <mailto:Sean.Finan@childrens.harvard.edu><mailto:Sean.Finan@childrens.harvard.edu
<mailto:Sean.Finan@childrens.harvard.edu>>> wrote:
> 
> Hi Britt,
> 
> You’ve got it exactly.
> 
> I actually started working on this right before a meeting right before I left work right
before I went to the store … but I’m now back to it and I’m going to move forward with
the tiny bot that I’ve got.  I don’t think that it will take too long …
> 
> One reason that I like the “pair” idea is that something like “CN123456” won’t
get converted to “CN0123456” by assuming that it is a seven digit numerical base. Likewise
somebody could make a tiny dictionary with “SEAN01, SEAN02, SEAN03…” through 99.  Then
their output would still be formatted as “SEAN01 .. SEAN99”.  They couldn’t mix in “SEAN1,
SEAN2 …” though.  Is that too much of a restraint?  Hmmm.  Well, I’m going to push forward
with this idea.
> 
> I’ll check in whatever I get done tonight.
> 
> Cheers,
> Sean
> 
> 
> From: britt fitch [mailto:britt.fitch@wiredinformatics.com <mailto:britt.fitch@wiredinformatics.com>]
> Sent: Wednesday, July 08, 2015 4:21 PM
> To: dev@ctakes.apache.org <mailto:dev@ctakes.apache.org><mailto:dev@ctakes.apache.org
<mailto:dev@ctakes.apache.org>>
> Subject: Re: dictionary-look-fast fails to handle alternative CUIs
> 
> Thanks for the details Sean. I had assumed the conversion to Long was related to sort/search
efficiency but that makes sense.
> 
> I had been thinking of something similar with parsing out the non-numerals and converting
them to 2-digit numeral values. i.e. “a” = 01, “z” = 26. Ultimately CN123456 would
become 0314123456 but I don’t think its sophisticated enough to avoid issues with leading
zeros. We could prepend a 9 to it to avoid losing digits and use something like:
> 
> if(length>8 && begins with 9)
>           discard 9
>           while (length > 8)
>                       convert first 2 numbers to a letter
> 
> I think your suggestion sounds good to me. To run the example through:
> 
> “NLM300" gets parsed to “NLM” + “300”
> Store Pair<Integer,String>(3, NLM) at Pair[0]
> Produce a Long of 0x10000000 + 300 = 300L
> Backtrack to the actual “CUI” floor(300/10000000) = 0L
> 300L - 0L = 300L
> Pair[0] = NLM
> CUI = NLM + 300
> 
> In that case, do we need to store it as a Pair at all or is just storing the prefix in
a String[] sufficient?
> 
> I’m happy to start working on this unless you have a preference for splitting it out
into multiple tasks.
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> Britt Fitch
> Wired Informatics
> 265 Franklin St Ste 1702
> Boston, MA 02110
> http://wiredinformatics.com <http://wiredinformatics.com/>
> Britt.Fitch@wiredinformatics.com <mailto:Britt.Fitch@wiredinformatics.com><mailto:Britt.Fitch@wiredinformatics.com
<mailto:Britt.Fitch@wiredinformatics.com>><mailto:Britt.Fitch@wiredinformatics.com
<mailto:Britt.Fitch@wiredinformatics.com>>
> 
> On Jul 8, 2015, at 2:54 PM, Finan, Sean <Sean.Finan@childrens.harvard.edu <mailto:Sean.Finan@childrens.harvard.edu><mailto:Sean.Finan@childrens.harvard.edu
<mailto:Sean.Finan@childrens.harvard.edu>><mailto:Sean.Finan@childrens.harvard.edu
<mailto:Sean.Finan@childrens.harvard.edu>>> wrote:
> 
> By the way, in case you are wondering why it does this … the umls database that we
use has roughly half a million cuis.  Storing cuis in the various tables as longs takes up
a lot less space than storing them as 8 character strings.
> 
> From: britt fitch [mailto:britt.fitch@wiredinformatics.com <mailto:britt.fitch@wiredinformatics.com>]
> Sent: Wednesday, July 08, 2015 2:23 PM
> To: dev@ctakes.apache.org <mailto:dev@ctakes.apache.org><mailto:dev@ctakes.apache.org
<mailto:dev@ctakes.apache.org>><mailto:dev@ctakes.apache.org <mailto:dev@ctakes.apache.org>>
> Subject: dictionary-look-fast fails to handle alternative CUIs
> 
> This is largely directed to Sean but open to other feedback as well.
> 
> The current fast lookup using a BSV parses the first field as “C” and up to 7 numerals,
padding with “0" as needed to reach that length when applicable [see CuiCodeUtil.getCuiCode(String)]
> 
> The CUI string is then substring’d from 1 to len and parsed as a Long.
> 
> This is producing issues with other related, but separate, ontologies (MedGen) where
the bulk of concepts use UMLS CUIs but some additional concepts were created by the NCBI where
no CUI previously existed.
> These MedGen-specific concepts are created with a prefix “CN” + 6 numerals, resulting
in “N123456” failing to produce a Long.
> 
> I wanted Sean’s thoughts on this and to get some feedback on if others are running
into this issue and if the community wants a solution to providing a CUI format beyond the
standard C + 7 numerals.
> 
> I’m happy to make these edits and check them in whether that means updating the CuiCodeUtil
class or creating an entirely new BSVConceptFactory if thats what makes the most sense.
> 
> Thoughts?
> 
> 
> 
> 
> 
> 
> 
> 
> 
> Britt Fitch
> Wired Informatics
> 265 Franklin St Ste 1702
> Boston, MA 02110
> http://wiredinformatics.com<http://wiredinformatics.com/> <http://wiredinformatics.com<http://wiredinformatics.com/>>
> Britt.Fitch@wiredinformatics.com <mailto:Britt.Fitch@wiredinformatics.com><mailto:Britt.Fitch@wiredinformatics.com
<mailto:Britt.Fitch@wiredinformatics.com>><mailto:Britt.Fitch@wiredinformatics.com
<mailto:Britt.Fitch@wiredinformatics.com>>


Mime
View raw message