Return-Path: Delivered-To: apmail-lucene-dev-archive@www.apache.org Received: (qmail 30685 invoked from network); 8 Nov 2010 19:52:54 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 8 Nov 2010 19:52:54 -0000 Received: (qmail 65203 invoked by uid 500); 8 Nov 2010 19:53:24 -0000 Delivered-To: apmail-lucene-dev-archive@lucene.apache.org Received: (qmail 65153 invoked by uid 500); 8 Nov 2010 19:53:23 -0000 Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@lucene.apache.org Delivered-To: mailing list dev@lucene.apache.org Received: (qmail 65146 invoked by uid 99); 8 Nov 2010 19:53:23 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 08 Nov 2010 19:53:23 +0000 X-ASF-Spam-Status: No, hits=0.7 required=10.0 tests=FREEMAIL_FROM,SPF_NEUTRAL,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [77.238.189.220] (HELO nm6-vm1.bullet.mail.ird.yahoo.com) (77.238.189.220) by apache.org (qpsmtpd/0.29) with SMTP; Mon, 08 Nov 2010 19:53:15 +0000 Received: from [77.238.189.55] by nm6.bullet.mail.ird.yahoo.com with NNFMP; 08 Nov 2010 19:52:54 -0000 Received: from [212.82.108.247] by tm8.bullet.mail.ird.yahoo.com with NNFMP; 08 Nov 2010 19:52:54 -0000 Received: from [127.0.0.1] by omp1012.mail.ird.yahoo.com with NNFMP; 08 Nov 2010 19:52:54 -0000 X-Yahoo-Newman-Property: ymail-5 X-Yahoo-Newman-Id: 102584.47514.bm@omp1012.mail.ird.yahoo.com Received: (qmail 865 invoked by uid 60001); 8 Nov 2010 19:52:53 -0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.co.uk; s=s1024; t=1289245973; bh=0Jowaj4tR8YbNCeLwvLO1YqWFlhT+7Wd1gw1kV3Gh6s=; h=Message-ID:X-YMail-OSG:Received:X-Mailer:References:Date:From:Subject:To:In-Reply-To:MIME-Version:Content-Type:Content-Transfer-Encoding; b=vOPJsJKTlcOXr6176f1WZH0n5JMFmxztLPL3okTLaJqXhr0t7nPQcb7diNk5OpI8XOprI+JD8pFr1LHQuLyBbsVX8Et+jOVoh661KmmWVIw+s6FSDh8FRNHX1SRsO4JollWStwjfxNPqAx+MJudAYV8/e8GN4HsvHSeXh/FeYyI= DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=yahoo.co.uk; h=Message-ID:X-YMail-OSG:Received:X-Mailer:References:Date:From:Subject:To:In-Reply-To:MIME-Version:Content-Type:Content-Transfer-Encoding; b=vbfb+F9NuZdMi1Y6+6Ug411vCoHH/XC3LGcpd7NFNV4is6Qt4hWBcui1AP30A4Wmr/n3ZXQYjJHnZmGbvn+uZ2ra0FOLNCaH5SQvU0PoF4NFhfr9mwCriKDCh+9mCfXsbYv+ue6Pw2ufEOvO9Lcjfzt+JKcXeK5Wa5gslUQDrKE=; Message-ID: <741272.848.qm@web29001.mail.ird.yahoo.com> X-YMail-OSG: 659V.fsVM1nqm_ZlzoS4KgnVH7PeF.F_cHhwcIxFCvegHPp FVFah3AhgeOe0QvZThJD34O7aDGUzV4sMNIpl5pQDdOMEUegrez8YB0M0HoB k255o.7h_WfMgZBgIVU_ZGCbzfoJg8ey55yFlytNtfQyPxzPfuFtk5HPB66E RnvAmq1AecZi8AFJ2SULxxmZGQWYkvozXfY81vS4seImo1RFzHg1IYqixAXA dqdcFn1lXlSZpBSOGWTKZ9Q_Lj1nW5A8g6uxzpxVKxLiSIxsqe80sOd3EjKQ pgpJguSTOAVOtSWywJZ5onLlmbaz7wi.P5wl72kTQ1KfWc5IFKLhV0esYwN5 9anE- Received: from [194.106.34.5] by web29001.mail.ird.yahoo.com via HTTP; Mon, 08 Nov 2010 19:52:53 GMT X-Mailer: YahooMailRC/504.5 YahooMailWebService/0.8.107.284920 References: <835116.95980.qm@web29009.mail.ird.yahoo.com> <201009251241.26379.paul.elschot@xs4all.nl> <0AC17B7C-0E40-42AA-8269-213D0730EBE1@yahoo.co.uk> <201009252303.28562.paul.elschot@xs4all.nl> <52407.98586.qm@web29010.mail.ird.yahoo.com> Date: Mon, 8 Nov 2010 19:52:53 +0000 (GMT) From: mark harwood Subject: Re: Document links To: dev@lucene.apache.org In-Reply-To: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org I came to the conclusion that the transient meaning of document ids is too = =0Adeeply ingrained in Lucene's design to use them to underpin any reliable= =0Alinking.=0AWhile it might work for relatively static indexes, any index= with a reasonable =0Anumber of updates or deletes will invalidate any stor= ed document references in =0Aways which are very hard to track. Lucene's co= mpaction shuffles IDs without =0Ataking care to preserve identity, unlike g= raph DBs like Neo4j (see "recycling =0AIDs" here: http://goo.gl/5UbJi )=0A= =0A=0ACheers,=0AMark=0A=0A=0A----- Original Message ----=0AFrom: Ryan McKin= ley =0ATo: dev@lucene.apache.org=0ASent: Mon, 8 November= , 2010 19:03:59=0ASubject: Re: Document links=0A=0AAny updates/progress wit= h this?=0A=0AI'm looking at ways to implement an RTree with lucene -- and t= his=0Adiscussion seems relevant=0A=0Athanks=0Aryan=0A=0A=0AOn Sat, Sep 25, = 2010 at 5:42 PM, mark harwood wrote:=0A>>>Both th= ese on disk data structures and the ones in a B+ tree have seek =0Aoffsets= =0A>>>into files=0A>>>that require disk seeks. And both could use document = ids as key values.=0A>=0A> Yep. However my approach doesn't use a doc id as= a key that is searched in any=0A> B+ tree index (which involves disk seeks= ) - it is used as direct offset into a=0A> file to get the pointer into a "= links" data structure.=0A>=0A>=0A>=0A>>>But do these disk data structures s= upport dynamic addition and deletion of=0A>>>(larger=0A>>>numbers of) docum= ent links?=0A>=0A> Yes, the slide deck I linked to shows how links (like do= cuments) spend the =0A>early=0A> stages of life being merged frequently in = the smaller, newer segments and over=0A> time migrate into larger, more sta= ble segments as part of Lucene transactions.=0A>=0A> That's the theory - I'= m currently benchmarking an early prototype.=0A>=0A>=0A>=0A> ----- Original= Message ----=0A> From: Paul Elschot =0A> To: dev@l= ucene.apache.org=0A> Sent: Sat, 25 September, 2010 22:03:28=0A> Subject: Re= : Document links=0A>=0A> Op zaterdag 25 september 2010 15:23:39 schreef Mar= k Harwood:=0A>> My starting point in the solution I propose was to eliminat= e linking via any=0A>>type of key. Key lookups mean indexes and indexes mea= n disk seeks. Graph=0A>>traversals have exponential numbers of links and so= all these index disk seeks=0A>>start to stack up. The solution I propose u= ses doc ids as more-or-less direct=0A>>pointers into file structures avoidi= ng any index lookup.=0A>> I've started coding up some tests using the file = structures I outlined and =0A>will=0A>>compare that with a traditional key-= based approach.=0A>=0A> Both these on disk data structures and the ones in = a B+ tree have seek offsets=0A> into files=0A> that require disk seeks. And= both could use document ids as key values.=0A>=0A> But do these disk data = structures support dynamic addition and deletion of=0A> (larger=0A> numbers= of) document links?=0A>=0A> B+ trees are a standard solution for problems = like this one, and it would=0A> probably=0A> not be easy to outperform them= .=0A> It may be possible to improve performance of B+ trees somewhat by spe= cializing=0A> for the fairly simple keys that would be needed, and by encod= ing very short=0A> lists of links=0A> for a single document directly into a= seek offset to avoid the actual seek, =0Abut=0A> that's=0A> about it.=0A>= =0A> Regards,=0A> Paul Elschot=0A>=0A>>=0A>> For reference - playing the "K= evin Bacon game" on a traditional Lucene index =0A>of=0A>>IMDB data took 18= seconds to find a short path that Neo4j finds in 200=0A>>milliseconds on t= he same data (and this was a disk based graph of 3m nodes, =0A10m=0A>>edges= ).=0A>> Going from actor->movies->actors->movies produces a lot of key look= ups and =0Athe=0A>>difference between key indexes and direct node pointers = becomes clear.=0A>> I know path finding analysis is perhaps not a typical L= ucene application but=0A>>other forms of link analysis e.g. recommendation = engines require similar=0A>>performance.=0A>>=0A>> Cheers=0A>> Mark=0A>>=0A= >>=0A>>=0A>> On 25 Sep 2010, at 11:41, Paul Elschot wrote:=0A>>=0A>> > Op v= rijdag 24 september 2010 17:57:45 schreef mark harwood:=0A>> >>>> While not= exactly equivalent, it reminds me of our earlier discussion=0A>>around=0A>= >=0A>> >>>> "layered segments" for dealing with field updates=0A>> >>=0A>> = >> Right. Fast discovery of document relations is a foundation on which lot= s =0A>of=0A>>=0A>> >> things like this can build. Relations can be given ty= pes to support a =0A>number=0A>>of=0A>>=0A>> >> different use cases.=0A>> >= =0A>> > How about using this (bsd licenced) tree as a starting point:=0A>> = > http://bplusdotnet.sourceforge.net/=0A>> > It has various keys: ao. byte = array, String and long.=0A>> >=0A>> > A fixed size byte array as key seems = to be just fine: two bytes for a field=0A>>number,=0A>> > four for the segm= ent number and four for the in-segment document id.=0A>> > The separate seg= ment number would allow to minimize the updates=0A>> > in the tree during m= erges. One could also use the normal doc id directly.=0A>> >=0A>> > The val= ue could then be a similar to the key, but without=0A>> > the field number,= and with an indication of the direction of the link.=0A>> > Or perhaps the= direction of the link should be added to the key.=0A>> > A link would be p= resent twice, once for each direction.=0A>> > Also both directions could ha= ve their own payloads.=0A>> >=0A>> > It could be put in its own file as a s= eparate 'segment', or maybe=0A>> > each segment could allow for allocation = of a part of this tree.=0A>> >=0A>> > I like this somehow, in case it is do= ne right one might never=0A>> > need a relational database again. Well, alm= ost...=0A>> >=0A>> > Regards,=0A>> > Paul Elschot=0A>> >=0A>> >=0A>> >>=0A>= > >>=0A>> >>=0A>> >> ----- Original Message ----=0A>> >> From: Grant Ingers= oll =0A>> >> To: dev@lucene.apache.org=0A>> >> Sent: F= ri, 24 September, 2010 16:26:27=0A>> >> Subject: Re: Document links=0A>> >>= =0A>> >> While not exactly equivalent, it reminds me of our earlier discuss= ion =0A>around=0A>>=0A>> >> "layered segments" for dealing with field updat= es [1], [2], albeit this is =0A>a=0A>>bit=0A>>=0A>> >> more generic since o= ne could not only use the links for relating =0Adocuments,=0A>>but=0A>>=0A>= > >> one could use "special" links underneath the covers in Lucene to=0A>>m= aintain/mark=0A>>=0A>> >> which fields have been updated and then traverse = to them.=0A>> >>=0A>> >> [1]=0A>> >>=0A>>http://www.lucidimagination.com/se= arch/document/c871ea4672dda844/aw_incremental_field_updates#7ef11a70cdc9538= 4=0A>>=0A>>=0A>> >>=0A>> >> [2]=0A>> >>=0A>>http://www.lucidimagination.com= /search/document/ee102692c8023548/incremental_field_updates#13ffdd50440cce6= e=0A>>=0A>>=0A>> >>=0A>> >>=0A>> >> On Sep 24, 2010, at 10:36 AM, mark harw= ood wrote:=0A>> >>=0A>> >>> This slideshow has a first-cut on the Lucene fi= le format extensions=0A>>required to=0A>>=0A>> >>>=0A>> >>> support fast li= nking between documents:=0A>> >>>=0A>> >>> http://www.slideshare.net/MarkHa= rwood/linking-lucene-documents=0A>> >>>=0A>> >>>=0A>> >>> Interested in any= of your thoughts.=0A>> >>>=0A>> >>> Cheers,=0A>> >>> Mark=0A>> >>>=0A>> >>= >=0A>> >>>=0A>> >>>=0A>> >>>=0A>> >>> -------------------------------------= --------------------------------=0A>> >>> To unsubscribe, e-mail: dev-unsub= scribe@lucene.apache.org=0A>> >>> For additional commands, e-mail: dev-help= @lucene.apache.org=0A>> >>>=0A>> >>=0A>> >> --------------------------=0A>>= >> Grant Ingersoll=0A>> >> http://lucenerevolution.org Apache Lucene/Solr = Conference, Boston Oct 7-8=0A>> >>=0A>> >>=0A>> >> ------------------------= ---------------------------------------------=0A>> >> To unsubscribe, e-mai= l: dev-unsubscribe@lucene.apache.org=0A>> >> For additional commands, e-mai= l: dev-help@lucene.apache.org=0A>> >>=0A>> >>=0A>> >>=0A>> >>=0A>> >> -----= ----------------------------------------------------------------=0A>> >> To= unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org=0A>> >> For additio= nal commands, e-mail: dev-help@lucene.apache.org=0A>> >>=0A>> >>=0A>> >>=0A= >> >=0A>> > ---------------------------------------------------------------= ------=0A>> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org=0A>= > > For additional commands, e-mail: dev-help@lucene.apache.org=0A>> >=0A>>= =0A>>=0A>> ----------------------------------------------------------------= -----=0A>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org=0A>> F= or additional commands, e-mail: dev-help@lucene.apache.org=0A>>=0A>>=0A>>= =0A>=0A> ------------------------------------------------------------------= ---=0A> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org=0A> For a= dditional commands, e-mail: dev-help@lucene.apache.org=0A>=0A>=0A>=0A>=0A> = ---------------------------------------------------------------------=0A> T= o unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org=0A> For additional= commands, e-mail: dev-help@lucene.apache.org=0A>=0A>=0A=0A----------------= -----------------------------------------------------=0ATo unsubscribe, e-m= ail: dev-unsubscribe@lucene.apache.org=0AFor additional commands, e-mail: d= ev-help@lucene.apache.org=0A=0A=0A --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional commands, e-mail: dev-help@lucene.apache.org