lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "J. Delgado" <joaquin.delg...@gmail.com>
Subject Re: Proposal - a high performance Key-Value store based on Lucene APIs/concepts
Date Mon, 26 Mar 2012 05:53:42 GMT
Hi Mark,

I'm interested in what you have done in somewhat peculiar way:

Currently, we use fields and terms in Lucene as the basis for the inverted
index. However, as you can read in this paper for indexing Boolean
expressions : http://theory.stanford.edu/~sergei/papers/vldb09-indexing.pdf,
they create posting lists for all possible attribute name and value
pairs
(also called keys) among the conjunctions. A posting list head contains the
key (A, v). The keys of the posting lists are stored in a hash table, which
will be used to search posting lists given keys of an assignment.

So perhaps the ability to mix the two different forms of indexes by
building a posting list for each entry in your KVStore may help me design a
Lucene-based solution for this problem.

-- J


On Sat, Mar 24, 2012 at 4:50 PM, Lance Norskog <goksron@gmail.com> wrote:

> Cool!
>
> On Sat, Mar 24, 2012 at 4:17 PM, Mark Harwood <markharw00d@yahoo.co.uk>
> wrote:
> > OK I have some code and benchmarks for this solution up on a Google Code
> project here: http://code.google.com/p/graphdb-load-tester/
> >
> > The project exists to address the performance challenges I have
> encountered when dealing with large graphs. It  uses all of the Wikipedia
> links as a test dataset and a choice of graph databases (most of which use
> Lucene BTW).
> > The test data is essentially 130 million edges representing links
> between pages e.g.  Communism->Russia.
> > To load the data all of the graph databases have to translate
> user-defined keys like "Russia" into an internally-generated node ID using
> a service that looks like this:
> >        interface KeyService
> >        {
> >                //Returns existing nodeid or -1 if is not already in store
> >                public long getGraphNodeId(String udk);
> >
> >                //Adds a new record - assumption is client has checked
> user defined key (udk) is not stored already using getGraphNodeId
> >                public void put(String udk, long graphNodeId);
> >        }
> >
> > This is a challenge on a dataset of this size. I tried using a
> Lucene-based implementation for this service with the following
> optimisations:
> > 1) a Bloomfilter to quickly "know what we don't know"
> > 2) an LRUCache to hold on to commonly referenced vertices e.g the
> Wikipdedia article for "United States"
> > 3) a hashmap representing the unflushed state of Lucene's IndexWriter to
> avoid the need for excessive flushing with NRT reader etc
> >
> > The search/write performance showed the familiar saw-toothing as the
> Lucene index grew in size and merge operations kicked in.
> >
> > The KVStore implementation I wrote attempts to tackle this problem using
> a fundamentally different form of index. The results from the KVStore runs
> show it was twice as fast as this  Lucene solution and maintains constant
> performance without the saw toothing effect.
> >
> > Benchmark figures are here: http://goo.gl/VQ027
> > The KVStore source code is here: http://goo.gl/ovkop and the Lucene
> implementation I compare against is also in the project.
> >
> > Cheers
> > Mark
> >
> >
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: dev-help@lucene.apache.org
> >
>
>
>
> --
> Lance Norskog
> goksron@gmail.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Mime
View raw message