Return-Path: X-Original-To: apmail-lucene-dev-archive@www.apache.org Delivered-To: apmail-lucene-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id D4C1B9849 for ; Mon, 26 Mar 2012 05:54:14 +0000 (UTC) Received: (qmail 66385 invoked by uid 500); 26 Mar 2012 05:54:13 -0000 Delivered-To: apmail-lucene-dev-archive@lucene.apache.org Received: (qmail 66061 invoked by uid 500); 26 Mar 2012 05:54:10 -0000 Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@lucene.apache.org Delivered-To: mailing list dev@lucene.apache.org Received: (qmail 66028 invoked by uid 99); 26 Mar 2012 05:54:08 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 26 Mar 2012 05:54:08 +0000 X-ASF-Spam-Status: No, hits=2.5 required=5.0 tests=FREEMAIL_REPLY,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of joaquin.delgado@gmail.com designates 209.85.214.48 as permitted sender) Received: from [209.85.214.48] (HELO mail-bk0-f48.google.com) (209.85.214.48) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 26 Mar 2012 05:54:04 +0000 Received: by bkcji17 with SMTP id ji17so5672488bkc.35 for ; Sun, 25 Mar 2012 22:53:43 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=WGoRdKrfdrd5vN80In4d1WnarVNeuqA2KmUx3CkV8DE=; b=PU5aTikxyPArK3W7AC0BnU3Fy+N1c97+xdycqpj/tlJjCcOo8cyju0dzYE2stRO4sP q7QFWtGEqwAUEXB5ivXsEO9k5oSz/JAh8+bZreEuiQvEi1o+vlHNO2kLyTudJb3+NDzW lWI3Vojwfr6HpuOAWWEkMvXVz4EQFZv58iytyW2iyGqackGVPK/JSI94Tm9anLF+NDH7 eAuBfrVN+1dF/i81Xw7yWpTS7AYTZdvTlDR2s05uQtwS2IhePMdKzZBU5p/arPpuaK8I jQEsODROWpzG+XZZQdLM72giCSkk9u/VweI2eAxRBjoRATrI8sNvv2fFmo10Hd8QjNoA Iptw== MIME-Version: 1.0 Received: by 10.204.151.195 with SMTP id d3mr7976042bkw.53.1332741222915; Sun, 25 Mar 2012 22:53:42 -0700 (PDT) Received: by 10.204.9.212 with HTTP; Sun, 25 Mar 2012 22:53:42 -0700 (PDT) In-Reply-To: References: <1332438147.53976.YahooMailNeo@web132202.mail.ird.yahoo.com> <194CC721-CD06-47DA-8B9D-B30AC24F6437@yahoo.co.uk> Date: Sun, 25 Mar 2012 22:53:42 -0700 Message-ID: Subject: Re: Proposal - a high performance Key-Value store based on Lucene APIs/concepts From: "J. Delgado" To: dev@lucene.apache.org Content-Type: multipart/alternative; boundary=0015175cae4e2b5f0004bc1eff52 X-Virus-Checked: Checked by ClamAV on apache.org --0015175cae4e2b5f0004bc1eff52 Content-Type: text/plain; charset=ISO-8859-1 Hi Mark, I'm interested in what you have done in somewhat peculiar way: Currently, we use fields and terms in Lucene as the basis for the inverted index. However, as you can read in this paper for indexing Boolean expressions : http://theory.stanford.edu/~sergei/papers/vldb09-indexing.pdf, they create posting lists for all possible attribute name and value pairs (also called keys) among the conjunctions. A posting list head contains the key (A, v). The keys of the posting lists are stored in a hash table, which will be used to search posting lists given keys of an assignment. So perhaps the ability to mix the two different forms of indexes by building a posting list for each entry in your KVStore may help me design a Lucene-based solution for this problem. -- J On Sat, Mar 24, 2012 at 4:50 PM, Lance Norskog wrote: > Cool! > > On Sat, Mar 24, 2012 at 4:17 PM, Mark Harwood > wrote: > > OK I have some code and benchmarks for this solution up on a Google Code > project here: http://code.google.com/p/graphdb-load-tester/ > > > > The project exists to address the performance challenges I have > encountered when dealing with large graphs. It uses all of the Wikipedia > links as a test dataset and a choice of graph databases (most of which use > Lucene BTW). > > The test data is essentially 130 million edges representing links > between pages e.g. Communism->Russia. > > To load the data all of the graph databases have to translate > user-defined keys like "Russia" into an internally-generated node ID using > a service that looks like this: > > interface KeyService > > { > > //Returns existing nodeid or -1 if is not already in store > > public long getGraphNodeId(String udk); > > > > //Adds a new record - assumption is client has checked > user defined key (udk) is not stored already using getGraphNodeId > > public void put(String udk, long graphNodeId); > > } > > > > This is a challenge on a dataset of this size. I tried using a > Lucene-based implementation for this service with the following > optimisations: > > 1) a Bloomfilter to quickly "know what we don't know" > > 2) an LRUCache to hold on to commonly referenced vertices e.g the > Wikipdedia article for "United States" > > 3) a hashmap representing the unflushed state of Lucene's IndexWriter to > avoid the need for excessive flushing with NRT reader etc > > > > The search/write performance showed the familiar saw-toothing as the > Lucene index grew in size and merge operations kicked in. > > > > The KVStore implementation I wrote attempts to tackle this problem using > a fundamentally different form of index. The results from the KVStore runs > show it was twice as fast as this Lucene solution and maintains constant > performance without the saw toothing effect. > > > > Benchmark figures are here: http://goo.gl/VQ027 > > The KVStore source code is here: http://goo.gl/ovkop and the Lucene > implementation I compare against is also in the project. > > > > Cheers > > Mark > > > > > > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org > > For additional commands, e-mail: dev-help@lucene.apache.org > > > > > > -- > Lance Norskog > goksron@gmail.com > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org > For additional commands, e-mail: dev-help@lucene.apache.org > > --0015175cae4e2b5f0004bc1eff52 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Hi Mark,

I'm interested in what you have done in som= ewhat peculiar way:

Currently, we use fields and t= erms in Lucene as the basis for the inverted index. However, as you can rea= d in this paper for indexing Boolean expressions :=A0http://theory.stanford.= edu/~sergei/papers/vldb09-indexing.pdf , they=A0create posting lists fo= r all possible attribute name and value pairs (also=A0called keys) among th= e conjunctions. A posting list head contains=A0the key (A, v).=A0The keys o= f=A0the posting lists are stored in a hash table, which will be used to=A0s= earch posting lists given keys of an assignment.

So perhaps the ability to mix the two different forms o= f indexes by building a posting list for each entry in your KVStore may hel= p me design a Lucene-based solution for this problem.

-- J


On Sat, M= ar 24, 2012 at 4:50 PM, Lance Norskog <goksron@gmail.com> wrote:
Cool!

On Sat, Mar 24, 2012 at 4:17 PM, Mark Harwood <markharw00d@yahoo.co.uk> wrote:
> OK I have some code and benchmarks for this solution up on a Google Co= de project here: http://code.google.com/p/graphdb-load-tester/
>
> The project exists to address the performance challenges I have encoun= tered when dealing with large graphs. It =A0uses all of the Wikipedia links= as a test dataset and a choice of graph databases (most of which use Lucen= e BTW).
> The test data is essentially 130 million edges representing links betw= een pages e.g. =A0Communism->Russia.
> To load the data all of the graph databases have to translate user-def= ined keys like "Russia" into an internally-generated node ID usin= g a service that looks like this:
> =A0 =A0 =A0 =A0interface KeyService
> =A0 =A0 =A0 =A0{
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0//Returns existing nodeid or -1 if is n= ot already in store
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0public long getGraphNodeId(String udk);=
>
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0//Adds a new record - assumption is cli= ent has checked user defined key (udk) is not stored already using getGraph= NodeId
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0public void put(String udk, long graphN= odeId);
> =A0 =A0 =A0 =A0}
>
> This is a challenge on a dataset of this size. I tried using a Lucene-= based implementation for this service with the following optimisations:
> 1) a Bloomfilter to quickly "know what we don't know" > 2) an LRUCache to hold on to commonly referenced vertices e.g the Wiki= pdedia article for "United States"
> 3) a hashmap representing the unflushed state of Lucene's IndexWri= ter to avoid the need for excessive flushing with NRT reader etc
>
> The search/write performance showed the familiar saw-toothing as the L= ucene index grew in size and merge operations kicked in.
>
> The KVStore implementation I wrote attempts to tackle this problem usi= ng a fundamentally different form of index. The results from the KVStore ru= ns show it was twice as fast as this =A0Lucene solution and maintains const= ant performance without the saw toothing effect.
>
> Benchmark figures are here: http://goo.gl/VQ027
> The KVStore source code is here: http://goo.gl/ovkop and the Lucene implementation I compar= e against is also in the project.
>
> Cheers
> Mark
>
>
>
>
>
> ---------------------------------------------------------------------<= br> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>



--
Lance Norskog
goksron@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


--0015175cae4e2b5f0004bc1eff52--