lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Harwood <markharw...@yahoo.co.uk>
Subject Re: Document links
Date Sat, 25 Sep 2010 13:23:39 GMT
My starting point in the solution I propose was to eliminate linking via any type of key. Key
lookups mean indexes and indexes mean disk seeks. Graph traversals have exponential numbers
of links and so all these index disk seeks start to stack up. The solution I propose uses
doc ids as more-or-less direct pointers into file structures avoiding any index lookup.
I've started coding up some tests using the file structures I outlined and will compare that
with a traditional key-based approach.

For reference - playing the "Kevin Bacon game" on a traditional Lucene index of IMDB data
took 18 seconds to find a short path that Neo4j finds in 200 milliseconds on the same data
(and this was a disk based graph of 3m nodes, 10m edges).
Going from actor->movies->actors->movies produces a lot of key lookups and the difference
between key indexes and direct node pointers becomes clear.
I know path finding analysis is perhaps not a typical Lucene application but other forms of
link analysis e.g. recommendation engines require similar performance.

Cheers
Mark



On 25 Sep 2010, at 11:41, Paul Elschot wrote:

> Op vrijdag 24 september 2010 17:57:45 schreef mark harwood:
>>>> While not exactly equivalent, it reminds me of our earlier discussion around

>>>> "layered segments" for dealing with field updates
>> 
>> Right. Fast discovery of document relations is a foundation on which lots of 
>> things like this can build. Relations can be given types to support a number of 
>> different use cases.
> 
> How about using this (bsd licenced) tree as a starting point:
> http://bplusdotnet.sourceforge.net/
> It has various keys: ao. byte array, String and long.
> 
> A fixed size byte array as key seems to be just fine: two bytes for a field number,
> four for the segment number and four for the in-segment document id.
> The separate segment number would allow to minimize the updates
> in the tree during merges. One could also use the normal doc id directly.
> 
> The value could then be a similar to the key, but without
> the field number, and with an indication of the direction of the link.
> Or perhaps the direction of the link should be added to the key.
> A link would be present twice, once for each direction.
> Also both directions could have their own payloads.
> 
> It could be put in its own file as a separate 'segment', or maybe
> each segment could allow for allocation of a part of this tree.
> 
> I like this somehow, in case it is done right one might never
> need a relational database again. Well, almost...
> 
> Regards,
> Paul Elschot
> 
> 
>> 
>> 
>> 
>> ----- Original Message ----
>> From: Grant Ingersoll <gsingers@apache.org>
>> To: dev@lucene.apache.org
>> Sent: Fri, 24 September, 2010 16:26:27
>> Subject: Re: Document links
>> 
>> While not exactly equivalent, it reminds me of our earlier discussion around 
>> "layered segments" for dealing with field updates [1], [2], albeit this is a bit

>> more generic since one could not only use the links for relating documents, but 
>> one could use "special" links underneath the covers in Lucene to maintain/mark 
>> which fields have been updated and then traverse to them.
>> 
>> [1] 
>> http://www.lucidimagination.com/search/document/c871ea4672dda844/aw_incremental_field_updates#7ef11a70cdc95384
>> 
>> [2] 
>> http://www.lucidimagination.com/search/document/ee102692c8023548/incremental_field_updates#13ffdd50440cce6e
>> 
>> 
>> On Sep 24, 2010, at 10:36 AM, mark harwood wrote:
>> 
>>> This slideshow has a first-cut on the Lucene file format extensions required
to 
>>> 
>>> support fast linking between documents:
>>> 
>>> http://www.slideshare.net/MarkHarwood/linking-lucene-documents
>>> 
>>> 
>>> Interested in any of your thoughts.
>>> 
>>> Cheers,
>>> Mark
>>> 
>>> 
>>> 
>>> 
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>> 
>> 
>> --------------------------
>> Grant Ingersoll
>> http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct 7-8
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>> 
>> 
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>> 
>> 
>> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message