Return-Path: Delivered-To: apmail-lucene-general-archive@www.apache.org Received: (qmail 70998 invoked from network); 28 Jan 2008 07:58:32 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 28 Jan 2008 07:58:32 -0000 Received: (qmail 73275 invoked by uid 500); 28 Jan 2008 07:58:22 -0000 Delivered-To: apmail-lucene-general-archive@lucene.apache.org Received: (qmail 73249 invoked by uid 500); 28 Jan 2008 07:58:22 -0000 Mailing-List: contact general-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: general@lucene.apache.org Delivered-To: mailing list general@lucene.apache.org Received: (qmail 73238 invoked by uid 99); 28 Jan 2008 07:58:22 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 27 Jan 2008 23:58:22 -0800 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of sriks6711@gmail.com designates 209.85.198.191 as permitted sender) Received: from [209.85.198.191] (HELO rv-out-0910.google.com) (209.85.198.191) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 28 Jan 2008 07:57:53 +0000 Received: by rv-out-0910.google.com with SMTP id k20so1402110rvb.5 for ; Sun, 27 Jan 2008 23:57:59 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:to:subject:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; bh=LPFSmy6nv2kLMm2JubmeTTcrIFJEYxJSFaHNTA4GSx4=; b=DoXw6Lruh0dWzOK2WmAiyNBSHk1r0AY3FSs14PmJCTYD6FZAtPn8D8TdejLsbzFQUt9J26JCpLaDcwcdCZR4/mRB7gQY4Al4NTKmi7g8VCzDsceEApFdYgk2Y2NJXX9qj6bjQyaDv+RmYseROtVmDK1GT1Er9506STqwcaw0VrA= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:to:subject:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; b=wfILHWB3MN427b4TUYkWgyU8MxUEBK+wO2cGxR8ZHt82XBt/3P4cBkapVweatidstrhykL03deVmw/0rQaPulJiL0HwIGkP7lBvB3S80bcPabYYabq+RnueavkzjC8YIxk/Gn7T3tebcNRXwjeDlqeFbfAoTJSY7sH14engvJrI= Received: by 10.141.90.17 with SMTP id s17mr1477862rvl.129.1201507078916; Sun, 27 Jan 2008 23:57:58 -0800 (PST) Received: by 10.140.135.21 with HTTP; Sun, 27 Jan 2008 23:57:58 -0800 (PST) Message-ID: <3ca19aa40801272357g46aa3379v91f867c61b2454f3@mail.gmail.com> Date: Mon, 28 Jan 2008 13:27:58 +0530 From: "Srikant Jakilinki" To: general@lucene.apache.org Subject: Re: Full-Text Search in a Relational Model In-Reply-To: <15063631.post@talk.nabble.com> MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Content-Disposition: inline References: <15063631.post@talk.nabble.com> X-Virus-Checked: Checked by ClamAV on apache.org My first impression is that you need a proper DB and a search on top of it (but not using the DB/SQL). Perhaps you could try these - 1) http://www.opensymphony.com/compass/content/about.html 2) http://kasparov.skife.org/blog/2004/09/11/#lucene-ojb 3) http://www.dbsight.net/ Please let us know if you find any other useful information in your search. - SJ On Jan 24, 2008 5:59 PM, yarongolan wrote: > > Hi, > > (Warning, not for the weak-hearted) > > I'm currently working on a project where we have a large and complex data > model, related to Genomics. We are trying to build a search engine that > provides "full text" and "field-based text" searches for our customer base > (mostly academic research), and are evaluating different tools for this > purpose. > > As a starting point, we have, as an example, a set of objects (stored in > tables as a relational model): > Gene [ID, Symbol, Description] > Article - M:M with Gene [ID, Title] > Disease - M:M with Gene [ID, Name] > Author - M:M with Article [ID, Name] > (Note: M:M tables exist, just link IDs) > > An example model would be (hierarchical, relations dealt with as > duplications) > > Gene [ID=1, Symbol=EGFR, Description=epidermal growth factor receptor] > Article [ID=1, Title=EGFR mutations in lung cancer: correlation with > clinical response to gefitinib therapy] > Author [ID=1, Name=H. Michaelson] > Author [ID=2, Name=J. Watson] > Article [ID=2, Title=Proteomics analysis of epidermal protein kinases by > target class-selective prefractionation and tandem mass spectrometry] > Author [ID=1, Name=H. Michaelson] > Author [ID=3, Name=M. Roberts] > Disease [ID=1, Name=Epidermal sluffing] > > Gene [ID=2, Symbol=AHCY, Description=S-adenosylhomocysteine hydrolase] > Article [ID=3, Title=Limited proteolysis of S-adenosylhomocysteine > hydrolase: implications for the three-dimensional structure] > Author [ID=4, Name=B. Cohen] > Author [ID=5, Name=L. Alexander] > Article [ID=2, Title=Proteomics analysis of epidermal protein kinases by > target class-selective prefractionation and tandem mass spectrometry] > Author [ID=1, Name=H. Michaelson] > Author [ID=3, Name=M. Roberts] > > Note IDs in the objects above, as they relay the relations in the > hierarchical model. > > In our Full-Text search, we would like to allow users to search ANY textual > field for any string. For instance, the term "epidermal", and display the > list of genes which have any data associated with them with that term > (ranked, of course). > Our list of results would be something like: > > EGFR > Found in Description (epidermal growth factor receptor) > Found in Article ID#2, in Title (proteomics analysis of epidermal protein > kinases by target class-selective prefractionation and tandem mass > spectrometry) > Found in Disease ID#1, in Name (Epidermal sluffing) > > AHCY > Found in Article ID#2, in Title (proteomics analysis of epidermal protein > kinases by target class-selective prefractionation and tandem mass > spectrometry) > > Note that the results retain a hierarchial view of our Genes (us being > Gene-Centric, we're pretty much framing the question "find this term related > in information related to those genes"). Also note that Article ID #2 has an > M:M with Gene ID2 (AHCY) and Gene ID1 (EGFR), and only due to that fact, > AHCY is considered a gene that has "epidermal" in its annotations. > > Obviously, we'd like to rank fields by location in hierarchy (A term in a > gene name is scored higher than the name of the author of an article related > to a gene) and by number of hits (number of times a term is found related to > that gene, 3 in the case of EGFR above). > > Ideas for how to take on this challenge? Implementation? Tools? > > Thanks! > Yaron Golan