lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From yarongolan <yar...@xennexinc.com>
Subject Full-Text Search in a Relational Model
Date Thu, 24 Jan 2008 12:29:34 GMT

Hi,

(Warning, not for the weak-hearted)

I'm currently working on a project where we have a large and complex data
model, related to Genomics. We are trying to build a search engine that
provides "full text" and "field-based text" searches for our customer base
(mostly academic research), and are evaluating different tools for this
purpose.

As a starting point, we have, as an example, a set of objects (stored in
tables as a relational model):
Gene [ID, Symbol, Description]
Article - M:M with Gene [ID, Title]
Disease - M:M with Gene [ID, Name]
Author - M:M with Article [ID, Name]
(Note: M:M tables exist, just link IDs)

An example model would be (hierarchical, relations dealt with as
duplications)

  Gene [ID=1, Symbol=EGFR, Description=epidermal growth factor receptor]
    Article [ID=1, Title=EGFR mutations in lung cancer: correlation with
clinical response to gefitinib therapy]
      Author [ID=1, Name=H. Michaelson]
      Author [ID=2, Name=J. Watson]
    Article [ID=2, Title=Proteomics analysis of epidermal protein kinases by
target class-selective prefractionation and tandem mass spectrometry]
      Author [ID=1, Name=H. Michaelson]
      Author [ID=3, Name=M. Roberts]
    Disease [ID=1, Name=Epidermal sluffing]

  Gene [ID=2, Symbol=AHCY, Description=S-adenosylhomocysteine hydrolase]
    Article [ID=3, Title=Limited proteolysis of S-adenosylhomocysteine
hydrolase: implications for the three-dimensional structure]
      Author [ID=4, Name=B. Cohen]
      Author [ID=5, Name=L. Alexander]
    Article [ID=2, Title=Proteomics analysis of epidermal protein kinases by
target class-selective prefractionation and tandem mass spectrometry]
      Author [ID=1, Name=H. Michaelson]
      Author [ID=3, Name=M. Roberts]

Note IDs in the objects above, as they relay the relations in the
hierarchical model.
      
In our Full-Text search, we would like to allow users to search ANY textual
field for any string. For instance, the term "epidermal", and display the
list of genes which have any data associated with them with that term
(ranked, of course).
Our list of results would be something like:

EGFR
  Found in Description (epidermal growth factor receptor)
  Found in Article ID#2, in Title (proteomics analysis of epidermal protein
kinases by target class-selective prefractionation and tandem mass
spectrometry)
  Found in Disease ID#1, in Name (Epidermal sluffing)

AHCY
  Found in Article ID#2, in Title (proteomics analysis of epidermal protein
kinases by target class-selective prefractionation and tandem mass
spectrometry)

Note that the results retain a hierarchial view of our Genes (us being
Gene-Centric, we're pretty much framing the question "find this term related
in information related to those genes"). Also note that Article ID #2 has an
M:M with Gene ID2 (AHCY) and Gene ID1 (EGFR), and only due to that fact,
AHCY is considered a gene that has "epidermal" in its annotations.

Obviously, we'd like to rank fields by location in hierarchy (A term in a
gene name is scored higher than the name of the author of an article related
to a gene) and by number of hits (number of times a term is found related to
that gene, 3 in the case of EGFR above).

Ideas for how to take on this challenge? Implementation? Tools? 

Thanks!
Yaron Golan

-- 
View this message in context: http://www.nabble.com/Full-Text-Search-in-a-Relational-Model-tp15063631p15063631.html
Sent from the Lucene - General mailing list archive at Nabble.com.


Mime
View raw message