Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-dev@lucene.apache.org
Received-SPF: pass (nike.apache.org: domain of buschmic@gmail.com designates
 209.85.217.216 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:date:message-id:subject:from:to
         :content-type;
        b=pC2mHA+0mVVlSoliBYesfCxwKqJKQlr+ymMrps1PyyEeuhmdzDQWfk+QCZtM4ROl3E
         Vl19OGSDIwlbzch0ymsOA16k0evsnpqde+uk4dre3gbR2CeiJerivHsv7JChYZNhoUZ5
         IdGggDj0E03PHj2PhW+aqeqwSgUQPxsSYYKzs=
MIME-Version: 1.0
In-Reply-To: <c68e39170908300608p612d97fao643e056eed30e4a4@mail.gmail.com>
References: <4A9A1AC8.7090805@gmail.com>
	 <c68e39170908300608p612d97fao643e056eed30e4a4@mail.gmail.com>
Date: Mon, 31 Aug 2009 00:22:40 -0700
Message-ID: <f74418b30908310022y4ac4ed1fpf204ef3829cfca5c@mail.gmail.com>
Subject: Re: Parallel incremental indexing
From: Michael Busch <buschmic@gmail.com>
To: java-dev@lucene.apache.org, yonik@lucidimagination.com
Content-Type: multipart/alternative; boundary=000e0cd4866c2de80e04726ae61d

--000e0cd4866c2de80e04726ae61d
Content-Type: text/plain; charset=ISO-8859-1

On Sun, Aug 30, 2009 at 6:08 AM, Yonik Seeley <yonik@lucidimagination.com>wrote:

> Cool stuff!
>
>
Thanks. It's actually really fun to work on! After I had the parallel
indexing working and didn't have to worry anymore about how to manage
parallel indexes the fun of implementing cool features on top of this
started. I hope you'll have that fun in Solr too! :)


> We should also think about how to do single document field updates or
> field adds since that is the most common usecase - not that it needs
>

I completely agree that we should solve that problem too.


> to be implemented in the first version, but kept in mind so we don't
> box ourselves in.
>

This code is currently non intrusive from Lucene's point of view (it can't
be cause I use it on top of vanilla 2.4.1). But I agree: when we integrate
it more tightly in Lucene to make it more efficient we should keep the end
goal in mind (e.g. the use cases you mentioned).


>
> Doug mentioned some ideas he had in passing almost a year ago about
> how to add a field to a single document, and it is similar in that it
> used parallel reader.  IndexWriter would be modified to maintain the
> same structure across parallel indexes, as you note.  If one wanted to
> add a new field value to document 1000, one would have to index dummy
> documents for docs 0-999... instead of this, the index format should
> support gaps.  On a segment merge, the IndexWriter could simply merge
> in this new segment.
>
>
Yeah currently it's kind of unefficient that we have to call addDocument()
999 times with an empty document to achive this. The .frq and .prx files
however work great as they use delta encoding. Also .del files support DGaps
now. On the other hand especially the stored fields index (.fdx) doesn't
support gaps because of random access support. Also norm files and term
vectors (though both can be turned off) don't support gaps.


> Anyway, updateable documents is fundamental enough, we should also
> consider changes to the index format if it makes it easer.
>
>
Yes I agree. We should make changes to the default index format if that
makes updating documents more efficient. Note that I said "default index
format" :) - I'm already excited about having parallel indexing and flexible
indexing in Lucene. It will be awesome what you can do then with Lucene!

So I think we should start with the necessary work to keep parallel indexes
in sync. When that's done we should continue with the usecases we discussed,
including the work of changing the index format to support gaps.


> -Yonik
> http://www.lucidimagination.com
>
>
> On Sun, Aug 30, 2009 at 2:23 AM, Michael Busch<buschmic@gmail.com> wrote:
> > Hi all,
> >
> > I just added a wiki page for a new feature I'd like to add to
> > Lucene. Please take a look at the link. I will add more details and
> > diagrams to the page, but for now it should give a rough idea about
> > how to implement it:
> >
> > http://wiki.apache.org/lucene-java/ParallelIncrementalIndexing
> >
> > Basically the idea is to allow updating documents partially, e.g. only
> > a subset of the fields without having to reindex the entire
> > document. This is a feature that is very often asked for.
> >
> > We have implemented the solution in IBM and it's working
> > great. It is a technology that allowed us already to add really exciting
> > new features to products that weren't easily possible before.
> >
> > The implementation I can currently contribute has some limitations:
> > e.g. multi-threaded indexing is not supported. But let me make clear
> > that this is not a limitation of the design described in the wiki - we
> > have these limitations because we implemented this on top of Lucene's 2.4
> > APIs. If we decide to add this to Lucene's core we should
> > reimplement some parts to overcome those limitations.
> >
> > In my opinion this will be a great addition to Lucene that many
> > people will find very useful. In Solr this is also something users often
> > ask for.
> >
> > In the last weeks I worked on getting internal approval for the
> contribution
> > to Lucene and the good news is that I already have a signed
> > software grant ready - so if the community likes this feature and
> > decides to add this to Lucene there won't be any delay for legal work
> > from IBM's side.
> >
> > Btw: I will be on vacation from 09/03-09/20 and won't have internet
> > access most of the time, so if I stop responding end of next week you'll
> > know why...
> >
> > Please let me know what you think!
> >
> >  Michael
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-dev-help@lucene.apache.org
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

--000e0cd4866c2de80e04726ae61d
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<br><br><div class=3D"gmail_quote">On Sun, Aug 30, 2009 at 6:08 AM, Yonik S=
eeley <span dir=3D"ltr">&lt;<a href=3D"mailto:yonik@lucidimagination.com">y=
onik@lucidimagination.com</a>&gt;</span> wrote:<br><blockquote class=3D"gma=
il_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-lef=
t:1ex;">
Cool stuff!<br>
<br></blockquote><div><br></div><div>Thanks. It&#39;s actually really fun t=
o work on! After I had the parallel indexing working and didn&#39;t have to=
 worry anymore about how to manage parallel indexes the fun of implementing=
 cool features on top of this started. I hope you&#39;ll have that fun in S=
olr too! :)</div>
<div>=A0</div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;=
border-left:1px #ccc solid;padding-left:1ex;">
We should also think about how to do single document field updates or<br>
field adds since that is the most common usecase - not that it needs<br></b=
lockquote><div><br></div><div>I completely agree that we should solve that =
problem too.</div><div>=A0</div><blockquote class=3D"gmail_quote" style=3D"=
margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">

to be implemented in the first version, but kept in mind so we don&#39;t<br=
>
box ourselves in.<br></blockquote><div><br></div><div>This code is currentl=
y non intrusive from Lucene&#39;s point of view (it can&#39;t be cause I us=
e it on top of vanilla 2.4.1). But I agree: when we integrate it more tight=
ly in Lucene to make it more efficient we should keep the end goal in mind =
(e.g. the use cases you mentioned).=A0</div>
<div>=A0</div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;=
border-left:1px #ccc solid;padding-left:1ex;">
<br>
Doug mentioned some ideas he had in passing almost a year ago about<br>
how to add a field to a single document, and it is similar in that it<br>
used parallel reader. =A0IndexWriter would be modified to maintain the<br>
same structure across parallel indexes, as you note. =A0If one wanted to<br=
>
add a new field value to document 1000, one would have to index dummy<br>
documents for docs 0-999... instead of this, the index format should<br>
support gaps. =A0On a segment merge, the IndexWriter could simply merge<br>
in this new segment.<br>
<br></blockquote><div><br></div><div><div>Yeah currently it&#39;s kind of u=
nefficient that we have to call addDocument() 999 times with an empty docum=
ent to achive this. The .frq and .prx files however work great as they use =
delta encoding. Also .del files support DGaps now. On the other hand especi=
ally the stored fields index (.fdx) doesn&#39;t support gaps because of ran=
dom access support. Also norm files and term vectors (though both can be tu=
rned off) don&#39;t support gaps.=A0</div>
<div><br></div></div><div>=A0</div><blockquote class=3D"gmail_quote" style=
=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">
Anyway, updateable documents is fundamental enough, we should also<br>
consider changes to the index format if it makes it easer.<br>
<br></blockquote><div><br></div><div><div>Yes I agree. We should make chang=
es to the default index format if that makes updating documents more effici=
ent. Note that I said &quot;default index format&quot; :) - I&#39;m already=
 excited about having parallel indexing and flexible indexing in Lucene. It=
 will be awesome what you can do then with Lucene!</div>
<div><br></div><div>So I think we should start with the necessary work to k=
eep parallel indexes in sync. When that&#39;s done we should continue with =
the usecases we discussed, including the work of changing the index format =
to support gaps.</div>
<div><br></div><div><br></div></div><div>=A0</div><blockquote class=3D"gmai=
l_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left=
:1ex;">
-Yonik<br>
<a href=3D"http://www.lucidimagination.com" target=3D"_blank">http://www.lu=
cidimagination.com</a><br>
<div><div></div><div class=3D"h5"><br>
<br>
On Sun, Aug 30, 2009 at 2:23 AM, Michael Busch&lt;<a href=3D"mailto:buschmi=
c@gmail.com">buschmic@gmail.com</a>&gt; wrote:<br>
&gt; Hi all,<br>
&gt;<br>
&gt; I just added a wiki page for a new feature I&#39;d like to add to<br>
&gt; Lucene. Please take a look at the link. I will add more details and<br=
>
&gt; diagrams to the page, but for now it should give a rough idea about<br=
>
&gt; how to implement it:<br>
&gt;<br>
&gt; <a href=3D"http://wiki.apache.org/lucene-java/ParallelIncrementalIndex=
ing" target=3D"_blank">http://wiki.apache.org/lucene-java/ParallelIncrement=
alIndexing</a><br>
&gt;<br>
&gt; Basically the idea is to allow updating documents partially, e.g. only=
<br>
&gt; a subset of the fields without having to reindex the entire<br>
&gt; document. This is a feature that is very often asked for.<br>
&gt;<br>
&gt; We have implemented the solution in IBM and it&#39;s working<br>
&gt; great. It is a technology that allowed us already to add really exciti=
ng<br>
&gt; new features to products that weren&#39;t easily possible before.<br>
&gt;<br>
&gt; The implementation I can currently contribute has some limitations:<br=
>
&gt; e.g. multi-threaded indexing is not supported. But let me make clear<b=
r>
&gt; that this is not a limitation of the design described in the wiki - we=
<br>
&gt; have these limitations because we implemented this on top of Lucene=
9;s 2.4<br>
&gt; APIs. If we decide to add this to Lucene&#39;s core we should<br>
&gt; reimplement some parts to overcome those limitations.<br>
&gt;<br>
&gt; In my opinion this will be a great addition to Lucene that many<br>
&gt; people will find very useful. In Solr this is also something users oft=
en<br>
&gt; ask for.<br>
&gt;<br>
&gt; In the last weeks I worked on getting internal approval for the contri=
bution<br>
&gt; to Lucene and the good news is that I already have a signed<br>
&gt; software grant ready - so if the community likes this feature and<br>
&gt; decides to add this to Lucene there won&#39;t be any delay for legal w=
ork<br>
&gt; from IBM&#39;s side.<br>
&gt;<br>
&gt; Btw: I will be on vacation from 09/03-09/20 and won&#39;t have interne=
t<br>
&gt; access most of the time, so if I stop responding end of next week you&=
#39;ll<br>
&gt; know why...<br>
&gt;<br>
&gt; Please let me know what you think!<br>
&gt;<br>
&gt; =A0Michael<br>
&gt;<br>
&gt;<br>
</div></div><div><div></div><div class=3D"h5">&gt; ------------------------=
---------------------------------------------<br>
&gt; To unsubscribe, e-mail: <a href=3D"mailto:java-dev-unsubscribe@lucene.=
apache.org">java-dev-unsubscribe@lucene.apache.org</a><br>
&gt; For additional commands, e-mail: <a href=3D"mailto:java-dev-help@lucen=
e.apache.org">java-dev-help@lucene.apache.org</a><br>
&gt;<br>
&gt;<br>
<br>
---------------------------------------------------------------------<br>
To unsubscribe, e-mail: <a href=3D"mailto:java-dev-unsubscribe@lucene.apach=
e.org">java-dev-unsubscribe@lucene.apache.org</a><br>
For additional commands, e-mail: <a href=3D"mailto:java-dev-help@lucene.apa=
che.org">java-dev-help@lucene.apache.org</a><br>
<br>
</div></div></blockquote></div><br>

--000e0cd4866c2de80e04726ae61d--