Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@lucene.apache.org
Received-SPF: pass (athena.apache.org: domain of joaquin.delgado@gmail.com
 designates 209.85.214.48 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAK9GPcSVdnhSDSDVWpmFZrucQ_RCSwSYj4kG1hn_-1e0w-DhfQ@mail.gmail.com>
References: <1332438147.53976.YahooMailNeo@web132202.mail.ird.yahoo.com>
	<194CC721-CD06-47DA-8B9D-B30AC24F6437@yahoo.co.uk>
	<CAK9GPcSVdnhSDSDVWpmFZrucQ_RCSwSYj4kG1hn_-1e0w-DhfQ@mail.gmail.com>
Date: Sun, 25 Mar 2012 22:53:42 -0700
Message-ID: 
 <CAF2TqXh3ebjxcEn1NTxOyGzXjPwXdNKBMiD18abx5zbhAcsU8g@mail.gmail.com>
Subject: Re: Proposal - a high performance Key-Value store based on Lucene
 APIs/concepts
From: "J. Delgado" <joaquin.delgado@gmail.com>
To: dev@lucene.apache.org
Content-Type: multipart/alternative; boundary=0015175cae4e2b5f0004bc1eff52

--0015175cae4e2b5f0004bc1eff52
Content-Type: text/plain; charset=ISO-8859-1

Hi Mark,

I'm interested in what you have done in somewhat peculiar way:

Currently, we use fields and terms in Lucene as the basis for the inverted
index. However, as you can read in this paper for indexing Boolean
expressions : http://theory.stanford.edu/~sergei/papers/vldb09-indexing.pdf,
they create posting lists for all possible attribute name and value
pairs
(also called keys) among the conjunctions. A posting list head contains the
key (A, v). The keys of the posting lists are stored in a hash table, which
will be used to search posting lists given keys of an assignment.

So perhaps the ability to mix the two different forms of indexes by
building a posting list for each entry in your KVStore may help me design a
Lucene-based solution for this problem.

-- J


On Sat, Mar 24, 2012 at 4:50 PM, Lance Norskog <goksron@gmail.com> wrote:

> Cool!
>
> On Sat, Mar 24, 2012 at 4:17 PM, Mark Harwood <markharw00d@yahoo.co.uk>
> wrote:
> > OK I have some code and benchmarks for this solution up on a Google Code
> project here: http://code.google.com/p/graphdb-load-tester/
> >
> > The project exists to address the performance challenges I have
> encountered when dealing with large graphs. It  uses all of the Wikipedia
> links as a test dataset and a choice of graph databases (most of which use
> Lucene BTW).
> > The test data is essentially 130 million edges representing links
> between pages e.g.  Communism->Russia.
> > To load the data all of the graph databases have to translate
> user-defined keys like "Russia" into an internally-generated node ID using
> a service that looks like this:
> >        interface KeyService
> >        {
> >                //Returns existing nodeid or -1 if is not already in store
> >                public long getGraphNodeId(String udk);
> >
> >                //Adds a new record - assumption is client has checked
> user defined key (udk) is not stored already using getGraphNodeId
> >                public void put(String udk, long graphNodeId);
> >        }
> >
> > This is a challenge on a dataset of this size. I tried using a
> Lucene-based implementation for this service with the following
> optimisations:
> > 1) a Bloomfilter to quickly "know what we don't know"
> > 2) an LRUCache to hold on to commonly referenced vertices e.g the
> Wikipdedia article for "United States"
> > 3) a hashmap representing the unflushed state of Lucene's IndexWriter to
> avoid the need for excessive flushing with NRT reader etc
> >
> > The search/write performance showed the familiar saw-toothing as the
> Lucene index grew in size and merge operations kicked in.
> >
> > The KVStore implementation I wrote attempts to tackle this problem using
> a fundamentally different form of index. The results from the KVStore runs
> show it was twice as fast as this  Lucene solution and maintains constant
> performance without the saw toothing effect.
> >
> > Benchmark figures are here: http://goo.gl/VQ027
> > The KVStore source code is here: http://goo.gl/ovkop and the Lucene
> implementation I compare against is also in the project.
> >
> > Cheers
> > Mark
> >
> >
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: dev-help@lucene.apache.org
> >
>
>
>
> --
> Lance Norskog
> goksron@gmail.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

--0015175cae4e2b5f0004bc1eff52
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Hi Mark,<div><br></div><div>I&#39;m interested in what you have done in som=
ewhat peculiar way:</div><div><br></div><div>Currently, we use fields and t=
erms in Lucene as the basis for the inverted index. However, as you can rea=
d in this paper for indexing Boolean expressions :=A0<a href=3D"http://theo=
ry.stanford.edu/~sergei/papers/vldb09-indexing.pdf">http://theory.stanford.=
edu/~sergei/papers/vldb09-indexing.pdf</a> , they=A0create posting lists fo=
r all possible attribute name and value pairs (also=A0called keys) among th=
e conjunctions. A posting list head contains=A0the key (A, v).=A0The keys o=
f=A0the posting lists are stored in a hash table, which will be used to=A0s=
earch posting lists given keys of an assignment.</div>
<div><br></div><div>So perhaps the ability to mix the two different forms o=
f indexes by building a posting list for each entry in your KVStore may hel=
p me design a Lucene-based solution for this problem.</div><div><br></div>
<div>-- J</div><div><br></div><div><br><div class=3D"gmail_quote">On Sat, M=
ar 24, 2012 at 4:50 PM, Lance Norskog <span dir=3D"ltr">&lt;<a href=3D"mail=
to:goksron@gmail.com">goksron@gmail.com</a>&gt;</span> wrote:<br><blockquot=
e class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc sol=
id;padding-left:1ex">
Cool!<br>
<div class=3D"HOEnZb"><div class=3D"h5"><br>
On Sat, Mar 24, 2012 at 4:17 PM, Mark Harwood &lt;<a href=3D"mailto:markhar=
w00d@yahoo.co.uk">markharw00d@yahoo.co.uk</a>&gt; wrote:<br>
&gt; OK I have some code and benchmarks for this solution up on a Google Co=
de project here: <a href=3D"http://code.google.com/p/graphdb-load-tester/" =
target=3D"_blank">http://code.google.com/p/graphdb-load-tester/</a><br>
&gt;<br>
&gt; The project exists to address the performance challenges I have encoun=
tered when dealing with large graphs. It =A0uses all of the Wikipedia links=
 as a test dataset and a choice of graph databases (most of which use Lucen=
e BTW).<br>

&gt; The test data is essentially 130 million edges representing links betw=
een pages e.g. =A0Communism-&gt;Russia.<br>
&gt; To load the data all of the graph databases have to translate user-def=
ined keys like &quot;Russia&quot; into an internally-generated node ID usin=
g a service that looks like this:<br>
&gt; =A0 =A0 =A0 =A0interface KeyService<br>
&gt; =A0 =A0 =A0 =A0{<br>
&gt; =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0//Returns existing nodeid or -1 if is n=
ot already in store<br>
&gt; =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0public long getGraphNodeId(String udk);=
<br>
&gt;<br>
&gt; =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0//Adds a new record - assumption is cli=
ent has checked user defined key (udk) is not stored already using getGraph=
NodeId<br>
&gt; =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0public void put(String udk, long graphN=
odeId);<br>
&gt; =A0 =A0 =A0 =A0}<br>
&gt;<br>
&gt; This is a challenge on a dataset of this size. I tried using a Lucene-=
based implementation for this service with the following optimisations:<br>
&gt; 1) a Bloomfilter to quickly &quot;know what we don&#39;t know&quot;<br=
>
&gt; 2) an LRUCache to hold on to commonly referenced vertices e.g the Wiki=
pdedia article for &quot;United States&quot;<br>
&gt; 3) a hashmap representing the unflushed state of Lucene&#39;s IndexWri=
ter to avoid the need for excessive flushing with NRT reader etc<br>
&gt;<br>
&gt; The search/write performance showed the familiar saw-toothing as the L=
ucene index grew in size and merge operations kicked in.<br>
&gt;<br>
&gt; The KVStore implementation I wrote attempts to tackle this problem usi=
ng a fundamentally different form of index. The results from the KVStore ru=
ns show it was twice as fast as this =A0Lucene solution and maintains const=
ant performance without the saw toothing effect.<br>

&gt;<br>
&gt; Benchmark figures are here: <a href=3D"http://goo.gl/VQ027" target=3D"=
_blank">http://goo.gl/VQ027</a><br>
&gt; The KVStore source code is here: <a href=3D"http://goo.gl/ovkop" targe=
t=3D"_blank">http://goo.gl/ovkop</a> and the Lucene implementation I compar=
e against is also in the project.<br>
&gt;<br>
&gt; Cheers<br>
&gt; Mark<br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt; ---------------------------------------------------------------------<=
br>
&gt; To unsubscribe, e-mail: <a href=3D"mailto:dev-unsubscribe@lucene.apach=
e.org">dev-unsubscribe@lucene.apache.org</a><br>
&gt; For additional commands, e-mail: <a href=3D"mailto:dev-help@lucene.apa=
che.org">dev-help@lucene.apache.org</a><br>
&gt;<br>
<br>
<br>
<br>
</div></div><span class=3D"HOEnZb"><font color=3D"#888888">--<br>
Lance Norskog<br>
<a href=3D"mailto:goksron@gmail.com">goksron@gmail.com</a><br>
</font></span><div class=3D"HOEnZb"><div class=3D"h5"><br>
---------------------------------------------------------------------<br>
To unsubscribe, e-mail: <a href=3D"mailto:dev-unsubscribe@lucene.apache.org=
">dev-unsubscribe@lucene.apache.org</a><br>
For additional commands, e-mail: <a href=3D"mailto:dev-help@lucene.apache.o=
rg">dev-help@lucene.apache.org</a><br>
<br>
</div></div></blockquote></div><br></div>

--0015175cae4e2b5f0004bc1eff52--