lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ryan McKinley <ryan...@gmail.com>
Subject Re: Document links
Date Mon, 08 Nov 2010 19:03:59 GMT
Any updates/progress with this?

I'm looking at ways to implement an RTree with lucene -- and this
discussion seems relevant

thanks
ryan


On Sat, Sep 25, 2010 at 5:42 PM, mark harwood <markharw00d@yahoo.co.uk> wrote:
>>>Both these on disk data structures and the ones in a B+ tree have seek offsets
>>>into files
>>>that require disk seeks. And both could use document ids as key values.
>
> Yep. However my approach doesn't use a doc id as a key that is searched in any
> B+ tree index (which involves disk seeks) - it is used as direct offset into a
> file to get the pointer into a "links" data structure.
>
>
>
>>>But do these disk data structures support dynamic addition and deletion of
>>>(larger
>>>numbers of) document links?
>
> Yes, the slide deck I linked to shows how links (like documents) spend the early
> stages of life being merged frequently in the smaller, newer segments and over
> time migrate into larger, more stable segments as part of Lucene transactions.
>
> That's the theory - I'm currently benchmarking an early prototype.
>
>
>
> ----- Original Message ----
> From: Paul Elschot <paul.elschot@xs4all.nl>
> To: dev@lucene.apache.org
> Sent: Sat, 25 September, 2010 22:03:28
> Subject: Re: Document links
>
> Op zaterdag 25 september 2010 15:23:39 schreef Mark Harwood:
>> My starting point in the solution I propose was to eliminate linking via any
>>type of key. Key lookups mean indexes and indexes mean disk seeks. Graph
>>traversals have exponential numbers of links and so all these index disk seeks
>>start to stack up. The solution I propose uses doc ids as more-or-less direct
>>pointers into file structures avoiding any index lookup.
>> I've started coding up some tests using the file structures I outlined and will
>>compare that with a traditional key-based approach.
>
> Both these on disk data structures and the ones in a B+ tree have seek offsets
> into files
> that require disk seeks. And both could use document ids as key values.
>
> But do these disk data structures support dynamic addition and deletion of
> (larger
> numbers of) document links?
>
> B+ trees are a standard solution for problems like this one, and it would
> probably
> not be easy to outperform them.
> It may be possible to improve performance of B+ trees somewhat by specializing
> for the fairly simple keys that would be needed, and by encoding very short
> lists of links
> for a single document directly into a seek offset to avoid the actual seek, but
> that's
> about it.
>
> Regards,
> Paul Elschot
>
>>
>> For reference - playing the "Kevin Bacon game" on a traditional Lucene index of
>>IMDB data took 18 seconds to find a short path that Neo4j finds in 200
>>milliseconds on the same data (and this was a disk based graph of 3m nodes, 10m
>>edges).
>> Going from actor->movies->actors->movies produces a lot of key lookups and
the
>>difference between key indexes and direct node pointers becomes clear.
>> I know path finding analysis is perhaps not a typical Lucene application but
>>other forms of link analysis e.g. recommendation engines require similar
>>performance.
>>
>> Cheers
>> Mark
>>
>>
>>
>> On 25 Sep 2010, at 11:41, Paul Elschot wrote:
>>
>> > Op vrijdag 24 september 2010 17:57:45 schreef mark harwood:
>> >>>> While not exactly equivalent, it reminds me of our earlier discussion
>>around
>>
>> >>>> "layered segments" for dealing with field updates
>> >>
>> >> Right. Fast discovery of document relations is a foundation on which lots
of
>>
>> >> things like this can build. Relations can be given types to support a number
>>of
>>
>> >> different use cases.
>> >
>> > How about using this (bsd licenced) tree as a starting point:
>> > http://bplusdotnet.sourceforge.net/
>> > It has various keys: ao. byte array, String and long.
>> >
>> > A fixed size byte array as key seems to be just fine: two bytes for a field
>>number,
>> > four for the segment number and four for the in-segment document id.
>> > The separate segment number would allow to minimize the updates
>> > in the tree during merges. One could also use the normal doc id directly.
>> >
>> > The value could then be a similar to the key, but without
>> > the field number, and with an indication of the direction of the link.
>> > Or perhaps the direction of the link should be added to the key.
>> > A link would be present twice, once for each direction.
>> > Also both directions could have their own payloads.
>> >
>> > It could be put in its own file as a separate 'segment', or maybe
>> > each segment could allow for allocation of a part of this tree.
>> >
>> > I like this somehow, in case it is done right one might never
>> > need a relational database again. Well, almost...
>> >
>> > Regards,
>> > Paul Elschot
>> >
>> >
>> >>
>> >>
>> >>
>> >> ----- Original Message ----
>> >> From: Grant Ingersoll <gsingers@apache.org>
>> >> To: dev@lucene.apache.org
>> >> Sent: Fri, 24 September, 2010 16:26:27
>> >> Subject: Re: Document links
>> >>
>> >> While not exactly equivalent, it reminds me of our earlier discussion around
>>
>> >> "layered segments" for dealing with field updates [1], [2], albeit this
is a
>>bit
>>
>> >> more generic since one could not only use the links for relating documents,
>>but
>>
>> >> one could use "special" links underneath the covers in Lucene to
>>maintain/mark
>>
>> >> which fields have been updated and then traverse to them.
>> >>
>> >> [1]
>> >>
>>http://www.lucidimagination.com/search/document/c871ea4672dda844/aw_incremental_field_updates#7ef11a70cdc95384
>>
>> >>
>> >> [2]
>> >>
>>http://www.lucidimagination.com/search/document/ee102692c8023548/incremental_field_updates#13ffdd50440cce6e
>>
>> >>
>> >>
>> >> On Sep 24, 2010, at 10:36 AM, mark harwood wrote:
>> >>
>> >>> This slideshow has a first-cut on the Lucene file format extensions
>>required to
>>
>> >>>
>> >>> support fast linking between documents:
>> >>>
>> >>> http://www.slideshare.net/MarkHarwood/linking-lucene-documents
>> >>>
>> >>>
>> >>> Interested in any of your thoughts.
>> >>>
>> >>> Cheers,
>> >>> Mark
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> ---------------------------------------------------------------------
>> >>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> >>> For additional commands, e-mail: dev-help@lucene.apache.org
>> >>>
>> >>
>> >> --------------------------
>> >> Grant Ingersoll
>> >> http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct 7-8
>> >>
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: dev-help@lucene.apache.org
>> >>
>> >>
>> >>
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: dev-help@lucene.apache.org
>> >>
>> >>
>> >>
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> > For additional commands, e-mail: dev-help@lucene.apache.org
>> >
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message