From general-return-600-apmail-lucene-general-archive=lucene.apache.org@lucene.apache.org Thu Jan 24 12:30:06 2008 Return-Path: Delivered-To: apmail-lucene-general-archive@www.apache.org Received: (qmail 71683 invoked from network); 24 Jan 2008 12:30:06 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 24 Jan 2008 12:30:06 -0000 Received: (qmail 58332 invoked by uid 500); 24 Jan 2008 12:29:56 -0000 Delivered-To: apmail-lucene-general-archive@lucene.apache.org Received: (qmail 58318 invoked by uid 500); 24 Jan 2008 12:29:55 -0000 Mailing-List: contact general-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: general@lucene.apache.org Delivered-To: mailing list general@lucene.apache.org Received: (qmail 58307 invoked by uid 99); 24 Jan 2008 12:29:55 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 24 Jan 2008 04:29:55 -0800 X-ASF-Spam-Status: No, hits=2.6 required=10.0 tests=DNS_FROM_OPENWHOIS,SPF_HELO_PASS,SPF_PASS,WHOIS_MYPRIVREG X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of lists@nabble.com designates 216.139.236.158 as permitted sender) Received: from [216.139.236.158] (HELO kuber.nabble.com) (216.139.236.158) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 24 Jan 2008 12:29:42 +0000 Received: from isper.nabble.com ([192.168.236.156]) by kuber.nabble.com with esmtp (Exim 4.63) (envelope-from ) id 1JI1D4-0004Ky-A1 for general@lucene.apache.org; Thu, 24 Jan 2008 04:29:34 -0800 Message-ID: <15063631.post@talk.nabble.com> Date: Thu, 24 Jan 2008 04:29:34 -0800 (PST) From: yarongolan To: general@lucene.apache.org Subject: Full-Text Search in a Relational Model MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Nabble-From: yarong@xennexinc.com X-Virus-Checked: Checked by ClamAV on apache.org Hi, (Warning, not for the weak-hearted) I'm currently working on a project where we have a large and complex data model, related to Genomics. We are trying to build a search engine that provides "full text" and "field-based text" searches for our customer base (mostly academic research), and are evaluating different tools for this purpose. As a starting point, we have, as an example, a set of objects (stored in tables as a relational model): Gene [ID, Symbol, Description] Article - M:M with Gene [ID, Title] Disease - M:M with Gene [ID, Name] Author - M:M with Article [ID, Name] (Note: M:M tables exist, just link IDs) An example model would be (hierarchical, relations dealt with as duplications) Gene [ID=1, Symbol=EGFR, Description=epidermal growth factor receptor] Article [ID=1, Title=EGFR mutations in lung cancer: correlation with clinical response to gefitinib therapy] Author [ID=1, Name=H. Michaelson] Author [ID=2, Name=J. Watson] Article [ID=2, Title=Proteomics analysis of epidermal protein kinases by target class-selective prefractionation and tandem mass spectrometry] Author [ID=1, Name=H. Michaelson] Author [ID=3, Name=M. Roberts] Disease [ID=1, Name=Epidermal sluffing] Gene [ID=2, Symbol=AHCY, Description=S-adenosylhomocysteine hydrolase] Article [ID=3, Title=Limited proteolysis of S-adenosylhomocysteine hydrolase: implications for the three-dimensional structure] Author [ID=4, Name=B. Cohen] Author [ID=5, Name=L. Alexander] Article [ID=2, Title=Proteomics analysis of epidermal protein kinases by target class-selective prefractionation and tandem mass spectrometry] Author [ID=1, Name=H. Michaelson] Author [ID=3, Name=M. Roberts] Note IDs in the objects above, as they relay the relations in the hierarchical model. In our Full-Text search, we would like to allow users to search ANY textual field for any string. For instance, the term "epidermal", and display the list of genes which have any data associated with them with that term (ranked, of course). Our list of results would be something like: EGFR Found in Description (epidermal growth factor receptor) Found in Article ID#2, in Title (proteomics analysis of epidermal protein kinases by target class-selective prefractionation and tandem mass spectrometry) Found in Disease ID#1, in Name (Epidermal sluffing) AHCY Found in Article ID#2, in Title (proteomics analysis of epidermal protein kinases by target class-selective prefractionation and tandem mass spectrometry) Note that the results retain a hierarchial view of our Genes (us being Gene-Centric, we're pretty much framing the question "find this term related in information related to those genes"). Also note that Article ID #2 has an M:M with Gene ID2 (AHCY) and Gene ID1 (EGFR), and only due to that fact, AHCY is considered a gene that has "epidermal" in its annotations. Obviously, we'd like to rank fields by location in hierarchy (A term in a gene name is scored higher than the name of the author of an article related to a gene) and by number of hits (number of times a term is found related to that gene, 3 in the case of EGFR above). Ideas for how to take on this challenge? Implementation? Tools? Thanks! Yaron Golan -- View this message in context: http://www.nabble.com/Full-Text-Search-in-a-Relational-Model-tp15063631p15063631.html Sent from the Lucene - General mailing list archive at Nabble.com.