Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 7978 invoked from network); 9 Oct 2006 07:24:00 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 9 Oct 2006 07:23:59 -0000 Received: (qmail 99930 invoked by uid 500); 9 Oct 2006 07:23:53 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 99893 invoked by uid 500); 9 Oct 2006 07:23:53 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 99882 invoked by uid 99); 9 Oct 2006 07:23:52 -0000 Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 09 Oct 2006 00:23:52 -0700 X-ASF-Spam-Status: No, hits=0.5 required=10.0 tests=DNS_FROM_RFC_ABUSE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (asf.osuosl.org: domain of chris.lu@gmail.com designates 64.233.182.188 as permitted sender) Received: from [64.233.182.188] (HELO nf-out-0910.google.com) (64.233.182.188) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 09 Oct 2006 00:23:50 -0700 Received: by nf-out-0910.google.com with SMTP id c2so1427923nfe for ; Mon, 09 Oct 2006 00:23:28 -0700 (PDT) DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=beta; d=gmail.com; h=received:message-id:date:from:to:subject:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; b=L6CGAkdL8YyEffWiBiqytzf021hhjr7vio8qeV5hIgCnqMNpPUW08nax+P96Gu9VX4LBFJa+V8p45hfPu/q7oFwmvBIw6Bj563KFBfej35TSWq3YFeOs6nS25mZSPGjmX8WGAPfnArDyT3SCQFv8GxfnSKxHfvwkkvZuc/+D+3s= Received: by 10.82.126.19 with SMTP id y19mr336864buc; Sun, 08 Oct 2006 12:33:27 -0700 (PDT) Received: by 10.82.133.17 with HTTP; Sun, 8 Oct 2006 12:33:27 -0700 (PDT) Message-ID: <6e3ae6310610081233x107ce8eeh50db5fef97ab7e40@mail.gmail.com> Date: Sun, 8 Oct 2006 12:33:27 -0700 From: "Chris Lu" To: java-user@lucene.apache.org Subject: Re: lucene link database In-Reply-To: <4529320B.2060508@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Content-Disposition: inline References: <4528DFF7.6080500@gmail.com> <359a92830610080722nb2dbe45xc03ad3ed7ba7da5c@mail.gmail.com> <4529320B.2060508@gmail.com> X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N Like Erick said, one Lucene Document usually doens't equal to one table entry. You need to flatten the database object into Lucene Document. You can write your code in Hibernate and use Compass to store data into Lucene. If you code is already finished, or you want a scalable solution, DBSight can help you to traverse the object graph and flatten the object into Lucene Document. Or If you have difficulty to flatten the object, maybe the best way for you is to search Lucene index twice, or once in DB, the other time in Lucene? You can use something like "node_id:123 OR node_id:234 OR ..." Chris Lu ------------------------------ Instant Lucene Search on Any Database/Application http://www.dbsight.net On 10/8/06, Cam Bazz wrote: > Dear Erick; > > Thank you for your detailed insight. I have been trying to code a graph > object database for sometime. > I have prototyped on relational as well as object oriented databases, > including opensource and commercial implementations. > (so far, I have tried hibernate, objectivity/db, db4o) while object > databases excel in traversing links, they are poor when searching. > > lucene so far solves the problem of solving. I am thinking of a document > as a list of tuples. (sequence of fields) and I can do searches with > lucene, it is really nice. > > now I have to solve the problem of linking. if I keep the nodes with a > lucene index, and I can fetch documents with a doc_id, or some sort of > surrogate identifier, and > use those identifiers as node_id in an object graph, that will be what I > want. but in order to do that I need to be able to query the lucene > index by document_id. > > I was referring to the link db of the nutch. They do have some sort of > link db implementation, that runs with hadoop, but I have not understood > the full code. > I am trying to understand the structure of this link database. I was > thinking of using documents with src and dst fields, that have document > id's as values. (one idea, I will try it tomorrow) > > Again thanks a bunch. > > Best Regards, > C.B. > > Erick Erickson wrote: > > Aproach it in whatever way you want as long as it solves your problem > > . > > > > My first question is why use lucene? Would a database suit your needs > > better? Of course, I can't say. Lucene shines at full-text searching, so > > it's a closer call if you aren't searching on parts of text. By that I > > mean > > that if you're not searching on *parts* of your links, you may want to > > consider a DB solution. > > > > That said, and if I understand your requirement, you have a pretty simple > > design. Each document has two fields, incominglinks and outgoing > > links. But > > see the note below. Lucene indexes what you give it, so the fact that > > some > > of the links aren't hypertext links is immaterial to Lucene. Since you > > control both the indexer and searcher, these confrom to whatever your > > requirements are. It's up to you to map semantics onto these entities. > > > > One common trap DB-savvy people have is that they think of documents as > > entries in a table, all with the same fields. There is nothing > > requiring you > > to have the *same* fields in each document in an index. You could have an > > index for which no two documents shared *any* common field if you choose. > > > > So, if you want to find out what, say, which documents have link X as an > > incoming link, just search on incominglinks:X. If you wanted to find the > > documents that had any incoming links X, Y, Z that matched an outgoing > > link > > in another document, just search the OR of these in outgoinglinks. > > > > If you want some kind of map of the whole web of links, you'll have to > > write > > some iterative loop and keep track. There's nothing built in that I > > know of > > that lets you answer "Given link X, show me all the documents no more > > than 3 > > hops away". Lucene is an *engine*, designed to have apps built on top > > of it. > > Lucene doesn't deal with relations between documents, just searching what > > you've indexed. > > > > It's easy enough to store a variable number of links in your > > incominglinks > > or outgoinglinks field. Just be sure they're tokenized appropriately. You > > can add them any way you choose, either concatenate them all into a big > > string and index that, or index them into the same field, e.g. > > Document doc = new Document(); > > doc.add("incoming", "link1"); > > doc.add("incoming", "link2"); > > . > > . > > . > > writer.add(doc); > > > > According to a discussion from a while ago, this is the same as > > doc.add("incoming", "link1 link2"); > > in terms of how it all gets handled internally. > > > > > > NOTE: I'm skipping most of the question of which Analyzer you use. > > This will > > almost surely trip you up sometime. I'd suggest starting with > > WhitespaceAnalyzer as that's more intuitive. Some of the other analyzers > > will break your links up in ways you don't expect. Really, really, really > > get a copy of Luke to see what's actually *in* your index and how > > searches > > work. And how the analyzer you choose changes what's searched for, as > > well > > as what's indexec. Google lucene luke and you'll find it. > > > > Anyway, hope this all helps. > > Erick > > > > On 10/8/06, Cam Bazz wrote: > >> > >> Hello, > >> > >> I would like to make a link database using lucene. Similar to one that > >> nutch uses. I have read the basic documentation and understood how > >> document indexing, search, and scoring works. But what I like is > >> different documents having different kind of links (semantic links) to > >> each other. I would like to be able to search in the database like > >> incominglinksofdocument(id), outgoinglinksofdocument(id). the links I am > >> talking about, might not necessarily be hypertext links. > >> > >> How would I approach to a problem like this? > >> > >> Best Regards, > >> -C.B. > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > >> For additional commands, e-mail: java-user-help@lucene.apache.org > >> > >> > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org