lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Harwood <>
Subject Re: Proposal - a high performance Key-Value store based on Lucene APIs/concepts
Date Sat, 24 Mar 2012 23:17:34 GMT
OK I have some code and benchmarks for this solution up on a Google Code project here:

The project exists to address the performance challenges I have encountered when dealing with
large graphs. It  uses all of the Wikipedia links as a test dataset and a choice of graph
databases (most of which use Lucene BTW).
The test data is essentially 130 million edges representing links between pages e.g.  Communism->Russia.
To load the data all of the graph databases have to translate user-defined keys like "Russia"
into an internally-generated node ID using a service that looks like this: 
	interface KeyService
		//Returns existing nodeid or -1 if is not already in store
		public long getGraphNodeId(String udk);

		//Adds a new record - assumption is client has checked user defined key (udk) is not stored
already using getGraphNodeId
		public void put(String udk, long graphNodeId);

This is a challenge on a dataset of this size. I tried using a Lucene-based implementation
for this service with the following optimisations:
1) a Bloomfilter to quickly "know what we don't know"
2) an LRUCache to hold on to commonly referenced vertices e.g the Wikipdedia article for "United
3) a hashmap representing the unflushed state of Lucene's IndexWriter to avoid the need for
excessive flushing with NRT reader etc

The search/write performance showed the familiar saw-toothing as the Lucene index grew in
size and merge operations kicked in.

The KVStore implementation I wrote attempts to tackle this problem using a fundamentally different
form of index. The results from the KVStore runs show it was twice as fast as this  Lucene
solution and maintains constant performance without the saw toothing effect.

Benchmark figures are here:
The KVStore source code is here: and the Lucene implementation I compare
against is also in the project.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message